Sloc Cloc and Code - ULOC Unique Lines of Code
2024/05/01 (949 words)

I recently pushed out a new release of Sloc Cloc and Code (scc) with two main pieces of functionality. The first being accurate .gitignore support. The latter being the first new feature to hit the codebase in a long time in the form of a new metric you can access Unique Lines of Code or ULOC.

A few years ago exwhyz (who appears to have dropped off the internet) posted a feature request to include this new metric into scc’s outputs.

They also helpfully included the following links, https://cmcenroe.me/2018/12/14/uloc.html and its lobste.rs discussion about the idea.

To save you a click I have included the most relevant quote,

In my opinion, the number this produces should be a better estimate of the complexity of a project. Compared to SLOC, not only are blank lines discounted, but so are close-brace lines and other repetitive code such as common includes. On the other hand, ULOC counts comments, which require just as much maintenance as the code around them does, while avoiding inflating the result with license headers which appear in every file, for example.

At the time I was busy doing other things, and I had been wanting to fix .gitignore support in scc by moving it to use gocodewalker. With that recently done I remembered this feature and took another look at it.

What really helped was that the link contained an implementation (this makes implementing it much easier)

sort -u *.h *.c | wc -l

I have always been pretty open with adding new metrics into scc, such as the complexity estimate, the COCOMO calculations, and frankly this sounded like it could be useful. Especially with the suggestion by minimax at lobste.rs of a DRYness calculation where DRYness = ULOC / SLOC. Since scc does have the SLOC count, this makes adding a DRYness calculation into the app rather trivial.

Now looking at the supplied calculation, it of course has a few problems in that it does not recurse directories, does not respect .gitignore files and groups languages together which may not be ideal. We can overcome all of that by adding this into scc. Which is what I did.

$ scc -i go,java -a --no-cocomo
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Go                          30      9335     1458       453     7424       1516
(ULOC)                              3930
-------------------------------------------------------------------------------
Java                        24      3913      798       651     2464        547
(ULOC)                               102
───────────────────────────────────────────────────────────────────────────────
Total                       54     13248     2256      1104     9888       2063
───────────────────────────────────────────────────────────────────────────────
Unique Lines of Code (ULOC)         4026
DRYness %                           0.30
───────────────────────────────────────────────────────────────────────────────
Processed 524736 bytes, 0.525 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

scc now has two additional flags for this, -a --dryness and -u --uloc with the former implying the latter. Adding a new metric to the output of what was already a very condensed display was took a while but we can now see per language the ULOC value, and of course the total. NB I am not sold on this output yet, and am happy to change it if someone comes up with something better.

Now what does this tell us? I know that the inclusion of Java here is a lot of redundant copy pasted Java files used in scc to test the duplication detection. This is reflected in the ULOC calculation of Java where only 102 of the lines are unique compared to the 3913 that exist.

Restriction to just Go gives the following output, with the corresponding DRYness % value increasing as is expected.

$ scc -i go -a --no-cocomo     
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Go                          30      9335     1458       453     7424       1516
(ULOC)                              3930
───────────────────────────────────────────────────────────────────────────────
Total                       30      9335     1458       453     7424       1516
───────────────────────────────────────────────────────────────────────────────
Unique Lines of Code (ULOC)         3930
DRYness %                           0.42
───────────────────────────────────────────────────────────────────────────────
Processed 395673 bytes, 0.396 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

Is the DRYness % there a good value? I had no idea. So I tried it against the C portion of the redis fork Valkey since it is/was well known to be a pretty clean codebase.

$ scc -a -i c --no-cocomo valkey
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                          423    244716    27969     42431   174316      42926
(ULOC)                            135502
───────────────────────────────────────────────────────────────────────────────
Total                      423    244716    27969     42431   174316      42926
───────────────────────────────────────────────────────────────────────────────
Unique Lines of Code (ULOC)       135502
DRYness %                           0.55
───────────────────────────────────────────────────────────────────────────────
Processed 8553843 bytes, 8.554 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

As expected the higher the value the more “clean” the codebase is, with a value of 0.55 being what I am guessing is a very good value. The closer you get to 1 the more DRY your code is, with a score close to 0.5 being considered “good”.

Alas its not a free lunch with the new metric eating into the runtime of scc… by a lot.

$ hyperfine 'scc -i c valkey' 'scc -i c -a valkey'
Benchmark 1: scc -i c valkey
  Time (mean ± σ):      48.8 ms ±   0.5 ms    [User: 94.5 ms, System: 38.5 ms]
  Range (min … max):    48.0 ms …  50.6 ms    56 runs
 
Benchmark 2: scc -i c -a valkey
  Time (mean ± σ):      89.3 ms ±   2.1 ms    [User: 154.5 ms, System: 45.8 ms]
  Range (min … max):    86.9 ms …  95.2 ms    32 runs
 
Summary
  scc -i c valkey ran
    1.83 ± 0.05 times faster than scc -i c -a valkey

Thankfully however I had paid so much attention to performance in scc over the years that while there is a cost for metrics you only want every now and again its not too painful to get.

Anyway is this useful? I have no idea. I hope the user exwhyz comes back and reports on it in the github issue at some point. If you however are using scc and find this useful please let me know as I am still learning what it means myself.