Why count lines of code? | Ben E. C. Boyter

A work colleague (let’s call him Owen as that’s his name) asked me the other day

“I dont understand the problem space scc et al solve. If you wanted to write a short post, i’d read and share the hell out of it. Basically, it seems like a heap of people can see the need for it, and I’m trying to understand it myself”

Owen is one of the more switched on people I know. As such if he is asking whats the point of tools such as scc, tokei, sloccount, cloc, loc and gocloc then I suspect quite a few other people are asking the same thing.

To quote the hero lead from a few of the tools mentioned.

scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages. Given two versions of a code base, cloc can compute differences in blank, comment, and source lines.

“SLOCCount” a set of tools for counting physical Source Lines of Code (SLOC) in a large number of languages of a potentially large set of programs.

Tokei is a program that displays statistics about your code. Tokei will show number of files, total lines within those files and code, comments, and blanks grouped by language.

So what?

I am going to explain personally where I have used these tools. Others may have different experiences but I suspect there will be a lot of overlap.

Here are some testimonials from SLOCCount

“SLOCCount allows me to easily and quickly quantify the source lines of code and variety in languages. Even though these are just two fairly basic aspects of a project, it helps a lot to get a first impression of the size and complexity of projects.” – Auke Jilderda, Philips Research.

“SLOCCount has really helped us a lot in our studies on libre software engineering” – Jesus M. Gonzalez Barahona, Grupo de Sistemas y Comunicaciones, ESCET, Universidad Rey Juan Carlos.

“Thanks for SLOCCount! It’s great… We’re using SLOCs derived from SLOCCount to compare our software to the software it replaces … Keep up the good work” – Sam Tregar

“Wow, using sloccount on the full POPFile source shows that developing it would have cost around $500K in a regular software company. That seems about right given the length of time we’ve been working on it and the number of people involved. Cool tool.” – John Graham Cumming

From some reddit threads https://www.reddit.com/r/rust/comments/82k9iy/loc_count_lines_of_code_quickly/ https://www.reddit.com/r/programming/comments/59bjoy/a_fast_cloc_replacement_written_in_rust/ https://www.reddit.com/r/rust/comments/3lnxht/tokei_a_cloccount_lines_of_code_tool_built_in_rust/ I found the following,

Been using loc for quite a while now and it’s pretty great. I love being able to update the team on how far along we are converting all java to kotlin.

I just usually use it to see how fast different parts of our codebases are growing. A few months ago in one of our projects we had 70k lines of kotlin, and now we’re at 90k.

It’s just a fun little tool. And yeah as someone pointed out below it shows you rogue languages in a project.

The above comments apply to all of the code counting tools tokei, cloc, sloccount, gocloc, loc and scc.

However scc takes the idea a little further than the other tools by including a complexity estimate. Anyone who has worked with Visual Studio and .NET languages for a few years will have eventually discovered that one of the neat things you can do with it is produce cyclomatic complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity reports, down to counts per solution/project/namespace/file/class/method.

I always wanted something like that for all languages. While calculating true cyclomatic complexity requires building a AST for each language and processing edges in it, I took a different approach. It is certainly not as accurate as the proper calculation but considerably faster and in all my tests gives a reasonable estimate that should be in line with a proper cyclomatic complexity calculation on a per file level.

What triggered me to do this however was working on an existing project I inherited. The code was in a bad state. But without a tool like scc I was unable to see how bad it really was. As such I underestimated how long it took to manage and it ended up exploding in scope, which is something I don’t care to repeat.

To show how it all works I am going to briefly walk through analyzing a project that I know Owen is far more familiar with than myself, Kombusion https://github.com/KablamoOSS/kombustion which is a AWS Cloudformation Tool on steroids. I am going to assume that the reader knows nothing about it beyond the name and what it does at this point.

To start lets just get a basic idea of what is in the current repository and the size. This example would work for any of the tools mentioned.

$ scc
-------------------------------------------------------------------------------
Language                 Files     Lines     Code  Comments   Blanks Complexity
-------------------------------------------------------------------------------
Go                        1782    333844   259667     39278    34899      41001
Markdown                    45      5400     4090         0     1310          0
YAML                        34      2021     1830        73      118         15
Assembly                    25       903      680         0      223         25
JSON                        15    319180   319180         0        0          0
Perl                         8      1141      859       140      142        163
Shell                        7       934      560       293       81         77
Protocol Buffers             6       912      693         1      218          0
TOML                         3       127       39        70       18          0
Plain Text                   3       227      190         0       37          0
Makefile                     3        64       47         0       17          8
HTML                         1        27       27         0        0          0
BASH                         1        21       16         2        3          0
C                            1        47       29         7       11          0
Dockerfile                   1        33        6        17       10          0
-------------------------------------------------------------------------------
Total                     1935    664881   587913     39881    37087      41289
-------------------------------------------------------------------------------
Estimated Cost to Develop $21,846,106
Estimated Schedule Effort 49.528242 months
Estimated People Required 52.248735
-------------------------------------------------------------------------------

What is apparent is that the vast majority of the application is written using Go. Knowing this, and that Go likely has a vendor directory which contains all of the requirements. Given that these are libraries which we probably do not want to know too much about lets run scc ignoring that directory. Again any of the code counting tools should be able to do this.

$ scc --pbl vendor -co
-------------------------------------------------------------------------------
Language                 Files     Lines     Code  Comments   Blanks Complexity
-------------------------------------------------------------------------------
Go                        1110     43605    34286      2568     6751       4152
Markdown                    20       864      633         0      231          0
YAML                        16      1717     1573        71       73          6
JSON                        14    319087   319087         0        0          0
ASP.NET                      4         9        9         0        0          0
HTML                         1        27       27         0        0          0
Dockerfile                   1        33        6        17       10          0
TOML                         1        59       24        25       10          0
-------------------------------------------------------------------------------
Total                     1167    365401   355645      2681     7075       4158
-------------------------------------------------------------------------------

What we can now see is that compared to the previous run the number of lines has dropped considerably from 333844 to 43605. This means there is a huge amount of code that this application depends on. We can also see that most of the languages have dropped off the list.

Perhaps most interesting at the moment is that there is a huge amount of JSON in the application. Lets inspect just the JSON to see what it might be. The below will white-list to just JSON files and ignore all the complexity calculations, but with the --files flag we can also see each file individually. Again any of the tools mentioned can do this.

$ scc --pbl vendor -wl json --files -c -co
-------------------------------------------------------------------------------
Language                     Files       Lines      Code    Comments     Blanks
-------------------------------------------------------------------------------
JSON                            14      319087    319087           0          0
-------------------------------------------------------------------------------
~te/source/North Virginia.json           25602     25602           0          0
generate/source/Oregon.json              25602     25602           0          0
generate/source/Ireland.json             25602     25602           0          0
generate/source/Ohio.json                23748     23748           0          0
generate/source/Tokyo.json               23745     23745           0          0
generate/source/Sydney.json              23137     23137           0          0
~enerate/source/Frankfurt.json           22725     22725           0          0
generate/source/Seoul.json               22399     22399           0          0
~enerate/source/Singapore.json           21827     21827           0          0
generate/source/London.json              21607     21607           0          0
generate/source/Mumbai.json              21593     21593           0          0
~/source/North California.json           20672     20672           0          0
generate/source/Canada.json              20414     20414           0          0
~enerate/source/Sao Paulo.json           20414     20414           0          0
-------------------------------------------------------------------------------
Total                           14      319087    319087           0          0
-------------------------------------------------------------------------------

Looks like these are generated and region specific. I am going to make a guess at this point that they are checked in AWS cloud-formation definitions.

$ head -n 10 generate/source/Sydney.json
{
  "PropertyTypes": {
    "AWS::ElasticLoadBalancingV2::ListenerCertificate.Certificate": {
      "Documentation": "<http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticloadbalancingv2-listener-certificates.html>",
      "Properties": {
        "CertificateArn": {
          "Documentation": "<http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticloadbalancingv2-listener-certificates.html#cfn-elasticloadbalancingv2-listener-certificates-certificatearn>",
          "PrimitiveType": "String",
          "Required": false,
          "UpdateType": "Mutable"

Looks like the guess was right. Lets continue to explore, but this time lets ignore JSON and focus on Go which is the meat of the application.

This is where scc is most useful as we can sort by complexity to find which files are likey to contain the most logic. The below will white-list to Go files sorted by complexity ignoring the vendor directory.

$ scc --pbl vendor -wl go --files -s complexity
-------------------------------------------------------------------------------
Language                 Files     Lines     Code  Comments   Blanks Complexity
-------------------------------------------------------------------------------
Go                        1110     43605    34286      2568     6751       4152
-------------------------------------------------------------------------------
~ternal/genplugin/genplugin.go       398      335         3       60         83
generate/generate.go                 681      573        22       86         81
~al/cloudformation/template.go       277      216        19       42         54
internal/plugins/load.go             211      157        22       32         51
internal/plugins/install.go          284      231        15       38         50
internal/plugins/add.go              220      163        23       34         36
~loudformation/tasks/upsert.go       128      109         6       13         26

From the above we can deduce that there are three files and one group of files that are reasonably complex and worth a look at if we wanted to work with this code-base. Those being genplugin.go, generate.go, template.go and the files in internal/plugins. We can also make a guess that there are few unit tests for any of the above as I would expect complex test files to appear next to them. They may still be covered by integration tests though perhaps written in another language.

Lets compare the above to tokei.

$ tokei --files -e vendor -s lines
-------------------------------------------------------------------------------
Go                   1110        43626        34292         2576         6758
-------------------------------------------------------------------------------
 ./generate/generate.go             689          574           23           92
 |l/genplugin/genplugin.go          399          336            3           60
 |ernal/plugins/install.go          284          231           15           38
 ./pkg/parsers/output.go            281          275            3            3
 |pkg/parsers/resources.go          281          275            3            3
 |oudformation/template.go          277          216           19           42
 |anifest/manifest_test.go          259          168           81           10
 ./internal/plugins/add.go          220          163           23           34
 |internal/plugins/load.go          211          157           22           32
 ./main.go                          176          101           57           18

Ignoring the differences in counts (all of the tools come up with different numbers) you can see that the tools have a difference of opinion when it comes to which files are potentially the most complex.

Looking at genplugin.go identified by scc as the most complex where two portions caught my eye.

func getVal(v interface{}, depth int, append string) string {
 typ := reflect.TypeOf(v).Kind()
 if typ == reflect.Int {
  return fmt.Sprint("\"", v, "\"", append)
 } else if typ == reflect.Bool {
  return fmt.Sprint("\"", v, "\"", append)
 } else if typ == reflect.String {
  return fmt.Sprint("\"", strings.Replace(strings.Replace(v.(string), "\n", "\\n", -1), "\"", "\\\"", -1), "\"", append)
 } else if typ == reflect.Slice {
  return printSlice(v.([]interface{}), depth+1, append)
 } else if typ == reflect.Map {
  return printMap(v.(map[interface{}]interface{}), depth+1, append)
 }

 return "UNKNOWN_TYPE_" + typ.String()
}

and

 if len(config.Parameters) > 0 {
  writeLine(buf, "// Parse the config data\n"+
   "var config "+resname+"Config\n"+
   "if err = yaml.Unmarshal([]byte(data), &config); err != nil {\n"+
   "  return\n"+
   "}\n"+
   "\n"+
   "// validate the config\n"+
   "config.Validate()\n\n// defaults\n")
  for paramName, param := range config.Parameters {
   if strings.HasPrefix(param.Type, "List<") || param.Type == "CommaDelimitedList" {
    writeLine(buf, "param"+paramName+" := []interface{}{}\n"+
     "if len(config.Properties."+paramName+") > 0 {\n"+
     "  param"+paramName+" = config.Properties."+paramName+"\n"+
     "}\n\n")
   } else {
    defaultVal := ""
    if param.Default != nil {
     defaultVal = fmt.Sprintf("%v", param.Default) // ensure it's a string if int is given
    }
    writeLine(buf, "param"+paramName+" := \""+defaultVal+"\"\n"+
     "if config.Properties."+paramName+" != nil {\n"+
     "  param"+paramName+" = *config.Properties."+paramName+"\n"+
     "}\n\n")
   }
  }
 }

I’d say both of those are portions of code I would want to take great care with were I do modify them. Note I have no idea what they are doing, and as such not commenting on the code quality here I am just exploring a code-base I have never looked at before. Its entirely possible this is the simplest way to solve this problem.

Whats nice is that you can use all of the above to get an idea of complex files in any project. For example, the below is an analysis of the source code for searchcodeserver.com

$ scc --wl java --files -s complexity searchcode-server
-------------------------------------------------------------------------------
Language                 Files     Lines     Code  Comments   Blanks Complexity
-------------------------------------------------------------------------------
Java                       131     19445    13913      1716     3816       1107
-------------------------------------------------------------------------------
~e/app/util/SearchCodeLib.java       616      418        90      108        108
~app/service/IndexService.java      1097      798        91      208         96
~/app/service/CodeMatcher.java       325      234        41       50         66
~service/TimeCodeSearcher.java       582      429        49      104         65
~ce/route/ApiRouteService.java       394      293        12       89         63
~de/app/service/Singleton.java       335      245        20       70         53
~rchcode/app/util/Helpers.java       396      299        38       59         52
~e/route/CodeRouteService.java       453      348         9       96         50

Hope that helps Owen with is understanding and whoever else happens to read this article. If you find scc useful please let me know either through email, twitter or just a nice comment on github.

Ben E. C. Boyter's Blog

Why count lines of code?
2018/06/11 (2101 words)

Why count lines of code? 2018/06/11 (2101 words)

Why count lines of code?
2018/06/11 (2101 words)