Update 2019-03-13
This is now part of a series of blog posts about scc
Sloc Cloc and Code which has now been optimised to be the fastest code counter for almost every workload. Read more about it at the following links.
- Sloc Cloc and Code - What happened on the way to faster Cloc 2018-04-16
- Sloc Cloc and Code Revisited - A focus on accuracy 2018-08-28
- Sloc Cloc and Code Revisited - Optimizing an already fast Go application 2018-09-19
- Sloc Cloc and Code a Performance Update 2019-01-09
- Sloc Cloc and Code Badges for Github/Bitbucket/Gitlab
Two things prompted me to start looking at my code counter scc
again. The first being the release of Go 1.11. New releases of compilers, libraries and tool-chains have a wonderful habit of making things go faster without you having to do anything other than recompile. In addition they often provide new methods which assist with this and are worth exploring.
The other was that the author of tokei
released a new update v8.0.0 and included a comparison to scc
on the project page https://github.com/Aaronepower/tokei/blob/master/COMPARISON.md
I had been tracking the improvements in tokei
, loc
and polyglot
over the last few weeks. However what really surprised me was the accuracy issues pointed out, particularly the fact that scc
version 1.7.0 was misreporting the number of lines.
Denial: Step one of software debugging.
I tried testing it out on the example provided by tokei
in the comparison page https://github.com/Aaronepower/tokei/blob/master/COMPARISON.md
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks Complexity
-------------------------------------------------------------------------------
Rust 1 33 28 1 4 5
-------------------------------------------------------------------------------
Total 1 33 28 1 4 5
-------------------------------------------------------------------------------
Wow. It really does misreport the number of lines. There should be 39 there.
This disturbed me quite a lot. In fact I had even written tests in scc
to ensure I got the number of lines correct. For example this one,
content := ""
for i := 0; i < 5000; i++ {
content += "a\n"
fileJob.Lines = 0
fileJob.Content = []byte(content)
CountStats(&fileJob)
if fileJob.Lines != int64(i+1) {
t.Errorf("Expected %d got %d", i+1, fileJob.Lines)
}
}
That should never happen.
Bargaining/Self-Blame: Stage two of software debugging.
Time to go code spelunking. Since I wrote scc
and its a fairly small code-base I had a feeling it was an issue to do with the skip ahead logic. When scc
finds a matching condition it keeps the offset around so it can jump ahead. The idea being we skip bytes we have looked at where possible if we know they matched a condition which changed the state. However if there was an error in this logic its possible it would jump over any newlines \n which are used to determine to total count.
The offending code in mind was this one.
// If we checked ahead on bytes we are able to jump ahead and save some time reprocessing
// the same values again
index += offsetJump
Just commenting out this and I got scc
to report the correct number of lines. Ouch. Turns out I made a boo-boo. That was rather stupid of me.
I Still issues with the rest of the stats but was happy with progress. I then turned my attention to the tokei
test suite and the simpler examples it had to verify correct output. The author of tokei
suggested just using the test suite from tokei
https://www.reddit.com/r/rust/comments/99e4tq/reading_files_quickly_in_rust/ so it seemed like a good idea. Also how could I be so stupid as to introduce this bug from day one and not notice it?
Anger: Stage three of software debugging.
I turned my attention to this example from the tokei code-base written in Java.
/* 23 lines 16 code 4 comments 3 blanks */
/*
* Simple test class
*/
public class Test
{
int j = 0; // Not counted
public static void main(String[] args)
{
Foo f = new Foo();
f.bar();
}
}
class Foo
{
public void bar()
{
System.out.println("FooBar"); //Not counted
}
}
Getting the stats from our now correct scc
(for lines counts anyway).
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks Complexity
-------------------------------------------------------------------------------
Java 1 23 18 2 3 0
-------------------------------------------------------------------------------
Total 1 23 18 2 3 0
-------------------------------------------------------------------------------
Well thats not brilliant. The only thing scc
got right was the number of files and the number of lines. Maybe if I tweak it a little bit I can resolve this issue and everything else will go away? In any case how in the heck could I never have noticed this. I knew that the edge cases are a bitch to deal with, but still…
Depression: Stage four of software debugging.
Looking into it the issues still appeared to be related to the end of line comments. When I first implemented scc
I set a special state at the end of closing multi-line comments. This would allow it to fall back into the code state when it hit a newline. However the result of this is that I introduced a bug. When there was a multi-line comment the last line of the multi-line would be counted as code. I never caught it because when I checked all my projects I don’t use multi-line comments most of the time.
In reality what should I should have done (which seems obvious in hindsight) is never process whitespace characters, unless they are a \n
newline which resets the state and counts whatever state the application is is. When I realized this I was rather depressed that it took me so long to work this out.
Acceptance: Stage five of software debugging.
A quick change to resolve the above, never process whitespace characters and all of a sudden everything was working as it should.
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks Complexity
-------------------------------------------------------------------------------
Java 1 23 16 4 3 0
-------------------------------------------------------------------------------
Total 1 23 16 4 3 0
-------------------------------------------------------------------------------
In fact running over the tokei
samples everything worked (with on exception covered later). So I had a look again at the torture test posted.
$ scc
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks Complexity
-------------------------------------------------------------------------------
Rust 1 38 29 5 4 5
-------------------------------------------------------------------------------
Total 1 38 29 5 4 5
-------------------------------------------------------------------------------
A much better result. However it still is not accurate, nor matching tokei
which produces, (BTW I am not a fan of the new full width result tokei
now produces and made it hard to get the below close to the above in terms of matching width).
$ tokei
--------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks
--------------------------------------------------------------------------------
Rust 1 38 32 2 4
--------------------------------------------------------------------------------
Total 1 38 32 2 4
--------------------------------------------------------------------------------
What’s the difference? One thing when looking at the source that caught my eye was the following,
let this_does_not = /* a /* nested */ comment " */
Nested comments? In fact I remember looking into this when I first wrote scc
. I was wondering about nested multi-line comments which turned out to be a compile error in Java, hence while I toyed with getting it working figured that was not a brilliant idea and explicitly made it work without them.
So the reason for the difference is that tokei
has some sort of stack for dealing with nested comments so it know when to finish with them. I didn’t even know was a thing.
Playing around with Rust and it turns out that it DOES support nested comments. My first thought was that this implementation is a bad idea. For example if you write the following /*/**/
that is going to break tokei
as everything will be a comment. Trying it out happens to be a compiler error… so it is not a case worth worrying about. If however you did happen to half implement a nested comment you get the following (I added it to the first line),
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks
-------------------------------------------------------------------------------
Rust 1 38 0 34 4
-------------------------------------------------------------------------------
Total 1 38 0 34 4
-------------------------------------------------------------------------------
Clearly the above is wrong, but then again so is the code as it will not compile. I have no idea if other languages will allow the above state. Also if you are reading this and know why you would even want nested comments please let me know. I cannot think of a good reason to implement them other than its a neat trick to put into your language.
Side note, this is why it is a good idea to at least toy around with other languages. If gives you greater perspective. Before I started my Rust journey I would have insisted that no mainstream language supports nested multi-line comments. Always be learning.
Acceptance: Stage five of software debugging.
Well knowing what is wrong is the second step to fixing it, with the first being knowing something is wrong. Clearly I underestimated how devious language designers can be.
To fix this isn’t a huge issue. Just need to keep a stack of the multi-line comment opens, and check when in comments for another one. Sadly during this process I noticed that scc
was missing quite a few edge cases. Thankfully the tokei
stress test is pretty brutal and allowed me to identify them all and resolve them.
After much tweaking and fiddling with the logic.
$ tokei
--------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks
--------------------------------------------------------------------------------
Rust 1 38 32 2 4
--------------------------------------------------------------------------------
Total 1 38 32 2 4
--------------------------------------------------------------------------------
$ scc
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks Complexity
-------------------------------------------------------------------------------
Rust 1 38 32 2 4 5
-------------------------------------------------------------------------------
Total 1 38 32 2 4 5
-------------------------------------------------------------------------------
Excellent.
However what price has tokei
paid for this logic. For example is it intelligent enough to know that Java does not support nested multi-line comments? Turns out it is. Also turns out that nested multi-line comments are more common across languages than I expected, Lisp, Rust, Lean, Jai, Idris, Scheme, Swift, Julia and Kotlin all support them. As such I added in the same checks to ensure that scc
is as accurate as tokei
, unless of course there are differences in the languages.json file that both use.
I tried then running across the full suite of tokei tests,
$ scc -c -co -s name tests
-------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks
-------------------------------------------------------------------------------
C++ 1 15 7 4 4
C++ Header 1 21 11 5 5
CMake 1 25 16 3 6
Cogent 1 7 2 2 3
Crystal 1 20 14 2 4
D 1 8 2 5 1
Dockerfile 1 16 6 3 7
Emacs Dev Env 1 16 6 7 3
Emacs Lisp 1 21 11 6 4
F# 1 13 4 5 4
F* 1 10 3 4 3
HTML 1 27 15 8 4
Java 1 23 16 4 3
MSBuild 1 12 10 1 1
Makefile 1 24 11 5 8
Meson 1 12 6 2 4
Module-Definition 1 17 9 6 2
Org 1 13 7 2 4
QML 1 20 11 5 4
Rakefile 1 10 4 2 4
Ruby 1 20 9 8 3
Rust 1 39 32 2 5
SRecode Template 1 37 23 2 12
Scheme 1 26 14 4 8
Scons 1 10 3 3 4
Ur/Web 2 17 9 5 3
Ur/Web Project 1 3 1 1 1
VHDL 1 30 20 4 6
Visual Basic 1 7 4 2 1
Xtend 1 23 13 4 6
-------------------------------------------------------------------------------
Total 31 542 299 116 127
-------------------------------------------------------------------------------
$ tokei tests
--------------------------------------------------------------------------------
Language Files Lines Code Comments Blanks
--------------------------------------------------------------------------------
CMake 1 25 16 3 6
Cogent 1 7 2 2 3
C++ 1 15 7 4 4
C++ Header 1 21 11 5 5
Crystal 1 20 14 2 4
D 1 8 5 1 2
Dockerfile 1 16 6 3 7
Emacs Lisp 1 21 11 6 4
Emacs Dev Env 1 16 6 7 3
F# 1 13 5 4 4
F* 1 10 3 4 3
HTML 1 27 15 8 4
Java 1 23 16 4 3
Makefile 1 24 11 5 8
Meson 1 12 6 2 4
Module-Definition 1 17 9 6 2
MSBuild 1 12 10 1 1
Org 1 13 7 2 4
QML 1 20 11 5 4
Rakefile 1 10 4 2 4
Ruby 1 20 9 8 3
Rust 1 39 32 2 5
SRecode Template 1 37 23 2 12
Scheme 1 26 14 4 8
Scons 1 10 3 3 4
Ur/Web 2 17 9 5 3
Ur/Web Project 1 3 1 1 1
VHDL 1 30 20 4 6
Visual Basic 1 7 4 2 1
Xtend 1 23 13 4 6
--------------------------------------------------------------------------------
Total 31 542 303 111 128
--------------------------------------------------------------------------------
The differences in the stats are down to how the language D works. I have a bug tracked for this https://github.com/boyter/scc/issues/27 to be resolved at some point in the future. Annoying but not worrying enough at this point to spend too much time on it.
With what appears to be most of the bugs ironed out time to look at performance again. With the changes that were made there are bound to be some wins, and with the new tools in Go I can hopefully spot some other issues. However that is a subject for another blog post.
For the moment if you want the newer more accurate scc
you can build from source or get the binaries for Windows/Linux/macOS https://github.com/boyter/scc/