Boilerplate Tax - Ranking popular programming languages by density

I was looking through Google Scholar the other day looking for references to scc since I had previously had people reach out to me about how they were using it for their thesis (It’s not narcissism, its research I swear!). I was curious to see how it was being used. One thing that I noticed was how researchers were using it to measure size of projects. A valid use case IMHO. However I was surprised about the lack of use of some of the more interesting features in scc.

A while ago I wrote about a new code measurement in scc called ULOC (Unique Lines of Code). In the interests of respecting your time here is the relevant quote from it,

In my opinion, the number this produces should be a better estimate of the complexity of a project. Compared to SLOC, not only are blank lines discounted, but so are close-brace lines and other repetitive code such as common includes. On the other hand, ULOC counts comments, which require just as much maintenance as the code around them does, while avoiding inflating the result with license headers which appear in every file, for example.

This is a measure that has been in scc for about 2 years now, and yet not used once. Which probably makes sense as I finished that post with,

Anyway is this useful? I have no idea.

What I really needed was to calculate ULOC over a lot of repositories. I could then objectively say what is a normal value or not. For example we all know Go has more boilerplate repetition than Rust, but by how much?

I didn’t want to run it over millions of repositories, but how about the top 1,000 or so? A brief search turned up this GitHub repository https://github.com/EvanLi/Github-Ranking/ which contains the top 100 repositories for different languages! Perfect. I cloned it down and it looks like it would be perfect for this test.

Turn’s out scc is not in the top 100 most starred GitHub repositories… sadge…

Even better since it has the repositories grouped by language I can narrow down to just that language avoiding all noise. Note that this has some issues because for C/C++ I would not be looking at the headers, but considering headers with a lot of logic are likely header import libraries that seemed acceptable.

In the interests of time, I turned to Google Gemini to generate a simple python script to do the work for me. Since I knew exactly what I wanted I could spec out a close to one shot prompt of what I needed.

I need a Python script.

It is going to do the following. Accept an argument to a directory. Look through that directory for .md markdown files. It needs to extract the name of the file minus the extension, so `java.md` would become `java`. It then opens each file and looks through each line to find github repositories, the format looks like this,

| 1 | [JavaGuide](https://github.com/Snailclimb/JavaGuide) | 153711 | 46107 | Java | 74 | Java 面试 & 后端通用面试指南，覆盖计算机基础、数据库、分布式、高并发与系统设计。准备后端技术面试，首
选 JavaGuide！ | 2026-01-31T15:55:16Z |
| 2 | [hello-algo](https://github.com/krahets/hello-algo) | 122078 | 14820 | Java | 6 | 《Hello 算法》：动画图解、一键运行的数据结构与算法教程。支持简中、繁中、English、日本語，提供 Python, Java, C++, C, C#, JS, Go, Swift, Rust, Ruby, Kotlin, TS, Dart 等代码实现 | 2026-01-23T10:40:44Z |

It extracts each https link to the repository, and adds .git on the end and saves it to a list.

Once done we then iterate the list doing a few things things.

1. Run a shell command out to git to clone the repository using a shallow clone. It should happen in the /tmp/ directory.
2. Run the scc tool (its installed globally) with arguments `scc -u -a --format sql  | sqlite3 ../ code.db` On the first run it should run `sql` then after run `sql-insert` . We should change to the directory first so we run at its root, but redirect the output one down so it remains
3. Clean up the directory cloned by deleting it

It should then exit the program.

With that done I had a script and could get to work. I ran it on my desktop and promptly went to bed. The next morning I had a SQLite database with every calculation. At least I thought I did. It uncovered a bug in scc which I have since resolved. So I queued up the run again and waited for the result, a 472 MB sqlite file.

So a few queries to know what we are working with.

sqlite> select count(*) from t;
2703656
sqlite> select sum(nCode) from t;
410529727
sqlite> select sum(nBlank) from t;
64972188
sqlite> select sum(nComment) from t;
65967492
sqlite> select count(*) from metadata;
3418

400 million lines of code ought to be enough for any sort of comparison!

My first query was to find out what is the uniqueness percentage across all languages. Uniqueness being take all the lines in a file, throw away anything thats a duplicate with the sum that remains being the unique count.

sqlite> SELECT
    Language,
    SUM(nCode + nComment + nBlank) as Total_Physical_Lines,
    SUM(nUloc) as Total_Unique_Lines,
    ROUND(CAST(SUM(nUloc) AS FLOAT) / SUM(nCode + nComment + nBlank) * 100, 2) as Uniqueness_Percentage
FROM t
WHERE (nCode + nComment + nBlank) > 0
GROUP BY Language
ORDER BY Uniqueness_Percentage DESC;

Shell|1041730|796519|76.46
Clojure|2206584|1670455|75.7
MATLAB|2989986|2164924|72.41
Vim Script|3778425|2724477|72.11
DM|53682824|37744972|70.31
Haskell|6286404|4283395|68.14
Perl|5953028|4038887|67.85
CoffeeScript|548420|371519|67.74
TeX|688517|455517|66.16
Kotlin|10001887|6489270|64.88
Scala|11267790|7281749|64.62
R|1975720|1271846|64.37
LaTeX|559617|358979|64.15
Java|55428466|34815209|62.81
PHP|18955944|11853895|62.53
Julia|2889492|1802993|62.4
Python|20974088|12963942|61.81
Ruby|13991320|8494290|60.71
TypeScript|27889057|16737261|60.01
Objective C|2567016|1534656|59.78
C++|62955292|37198201|59.09
ActionScript|6975719|4114080|58.98
C|64288288|37008874|57.57
Swift|4493217|2581337|57.45
JavaScript|15048799|8626131|57.32
Groovy|3587705|2044644|56.99
Rust|20738501|11777964|56.79
Go|46418301|25911541|55.82
Elixir|3870450|2132232|55.09
Dart|14691026|7846238|53.41
Powershell|2494712|1315880|52.75
C#|40139432|20418679|50.87
HTML|2597758|1238299|47.67
CSS|2160131|977208|45.24
Lua|7333761|2874755|39.2

Interesting. This suggests what we probably knew already, that things like shell scripts are generally bespoke hence having the most uniqueness. However I suspect there are outliers ruining the results here. There is no good reason for Lua to be at the bottom of the list. Lets try again trying to average them out,

sqlite> SELECT
    Language,
    COUNT(DISTINCT Project) as Repo_Count,
    ROUND(AVG(Project_Dryness), 2) as Dryness
FROM (
    SELECT
        Project,
        Language,
        -- DRYness = ULOC / (Code + Comment + Blank)
        (CAST(SUM(nUloc) AS FLOAT) / NULLIF(SUM(nCode + nComment + nBlank), 0) * 100) as Project_Dryness
    FROM t
    GROUP BY Project, Language
)
WHERE Project_Dryness IS NOT NULL
GROUP BY Language
ORDER BY Dryness DESC;

Note that this is applied per file, per repository. This has a few benefits,

It normalizes for project size somewhat
It reduces the influence of massive monorepos
It better isolates language idioms/boilerplate

Language	Repo Count	Dryness
Clojure	100	77.91%
Haskell	99	77.25%
MATLAB	105	75.72%
DM	99	74.41%
Shell	82	72.24%
TeX	62	71.81%
Vim Script	95	70.4%
CoffeeScript	99	70.05%
R	100	68.11%
Python	100	67.78%
Lua	99	67.77%
Kotlin	100	67.72%
Julia	100	67.4%
LaTeX	38	67.23%
Perl	86	67.03%
Scala	99	66.1%
Dart	99	65.79%
ActionScript	98	65.74%
Java	100	65.72%
Groovy	93	64.58%
JavaScript	98	64.52%
Powershell	101	63.73%
TypeScript	100	63.34%
Swift	100	63.28%
Objective C	120	63.15%
HTML	99	62.76%
Ruby	95	62.73%
PHP	100	62.05%
Rust	100	60.5%
C	98	59.71%
C++	100	59.33%
Elixir	100	58.97%
Go	99	58.78%
C#	99	58.4%
CSS	99	58.24%

Much better. This actually fits in what I would expect. Lisp style languages at the top, with scripts and other non reusable things just below. The same thing to the bottom languages which as you would expect have more repetition.

Coffeescript is an interesting case. As it was designed to be terse, it beats out TypeScript when it comes to being concise. You could use this to point out that the modern tooling increased the redundancy of the codebase compared to the terse era of old, as even JavaScript has more unique code.

What strikes me right away is that Java is a lot more DRY than I would have initially guessed! Was is really interesting is that this “proves” that modern doesn’t always mean cleaner or more DRY. The other JVM Languages are interesting too. Java is not as bad as you would first think compared to Groovy, Scala and Kotlin (lets ignore clojure here as its a different programming paradigm). Groovy having less uniqueness probably makes sense given that its used in DevOps scripting a fair bit, and if you have ever used CloudFormation or Terraform are aware how much repetition goes on there. What we can say though is that it looks like Kotlin didn’t just make Java prettier, it made it more dense.

Clojure and indeed all lisp style languages are the final boss of density here though. Almost every line is an expression of business logic. I never really got into it myself, but looking at these results are starting to make me rethink that decision.

If you compare Clojure (77.91%) to C# (58.4%), it seems the average C# developer writes 20% more redundant code every single day just to satisfy the compiler. Even with tools like resharper and LLM’s to help thats not a insignificant amount of effort.

Although I previously claimed we all know Go has more boilerplate repetition than Rust (I write blog posts as I go BTW and don’t revise statements like that as results come in) but as it turns out they are almost identical in their redundancy footprint. Looking at the data suggests they are much closer than you think.

The Go Zealot: “Go is simple!” “Yes, so simple you write the same thing over and over.”

The Rust Zealot: “Rust is perfectly expressive!” “Yes, but you spend 40% of your time in setup code and trait impls.”

So there it is. The dryness in order against each language. A baseline I can add to the scc repository as an example of what is considered average across very popular repositories.

Interpreting Dryness,

- 75% (High Density): Very terse, expressive code. Every line counts. (Example: Clojure, Haskell)
- 60% - 70% (Standard): A healthy balance of logic and structural ceremony. (Example: Java, Python)
- < 55% (High Boilerplate): High repetition. Likely due to mandatory error handling, auto-generated code, or verbose configuration. (Example: C#, CSS)

We spent decades building modern languages like Go, Rust, and TypeScript to solve some of the language mistakes of old. For some cases this is a huge success. But according to the data, if you want the highest ratio of human thought to keystrokes, the winner is the 60 year old concept, Lisp running as a modern JVM language Clojure.

But perhaps the main take away is progress in a lot of languages is measured by more noise to signal.

Food for thought there. Then again, perhaps LLM’s make this all moot now anyway.

Want to do this yourself? Feel free to improve on the results! I’ll link back to you, just contact me.

Script to automate the processing https://gist.github.com/boyter/47d8dcc968a03edaecda196097c3822b
scc https://github.com/boyter/scc but you will need to compile from source to get the uloc fix
Top 100 language lists https://github.com/EvanLi/Github-Ranking/

Ben E. C. Boyter's Blog

Boilerplate Tax - Ranking popular programming languages by density
2026/02/03 (1799 words)

Boilerplate Tax - Ranking popular programming languages by density 2026/02/03 (1799 words)

Boilerplate Tax - Ranking popular programming languages by density
2026/02/03 (1799 words)