Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100

Intro?

I blog boyter.org I free software github/boyter/ I run searchcode.com also on the twitter @boyter activitypub @boyter@honk.boyter.org

Largest dataset: ~6PB
Largest table: 2+ trillion rows
Highest QPS: 70,000/s under a DDoS

My name is Ben Boyter. You can find me online by the things you see up there. I am a tech principal at Kablamo specializing in data & applications. We are a digital product company and I have a passion for data platforms and search engines. So this talk is as the title suggests about my experience Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100. So im going to go through the reason for why I was doing this. The descriptions of what things I tried, the solution I settled on, and where I plan on taking this in the future. I will mix in some of the more interesting findings I got though as well so be on the lookout for those. I thought id throw in some stats oif the largest data sets I have worked on there. 6PB of data a table with 2 trillion rows (it grew by about 200 million every day) And the highest QPS I have dealt with which was ovber 70,000 rps while under a DDoS attack. Oh and in full disclosure the title is slightly clickbait. I didn't actually analyze 10 million projects. I was short by 15,000 and rounded up. Please forgive me.

Outline

What did I try?
What did I learn?
What worked?
The future?
Findings...

Why?

Why would anyone in their right mind do this?

I get asked this a lot by my managers.... I don't have a good reason for why I do anything outside of work to be honest. I just do things... sometimes I get something useful out of and write about it, or record it. I almost always have fun doing it so perhaps thats the reason. But for this talk I do have some answers. Why Go? Why dedicated server? As for why I made any decision... its mostly down to personal learning, and iterating. I was never looking for an optimial way to do this, and what I got was organic. There are far more efficient ways to achive what I did given some thinking ahead, so please refrain from yelling at me with them. I am aware, I just didn't care too much at the time. Thats not a why though... The why is that I write a lot of code outside of work and there are two projects I maintain which are relevant to this talk.

scc

```

$ scc redis 
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                          296    180267    20367     31679   128221      32548
C Header                   215     32362     3624      6968    21770       1636
TCL                        143     28959     3130      1784    24045       2340
Shell                       44      1658      222       326     1110        187
Autoconf                    22     10871     1038      1326     8507        953
Lua                         20       525       68        70      387         65
Markdown                    16      2595      683         0     1912          0
Makefile                    11      1363      262       125      976         59
Ruby                        10       795       78        78      639        116
gitignore                   10       162       16         0      146          0
YAML                         6       711       46         8      657          0
HTML                         5      9658     2928        12     6718          0
C++                          4       286       48        14      224         31
License                      4       100       20         0       80          0
Plain Text                   3       185       26         0      159          0
CMake                        2       214       43         3      168          4
CSS                          2       107       16         0       91          0
Python                       2       219       12         6      201         34
Systemd                      2        80        6         0       74          0
BASH                         1       118       14         5       99         31
Batch                        1        28        2         0       26          3
C++ Header                   1         9        1         3        5          0
Extensible Styleshe…         1        10        0         0       10          0
Smarty Template              1        44        1         0       43          5
m4                           1       562      116        53      393          0
───────────────────────────────────────────────────────────────────────────────
Total                      823    271888    32767     42460   196661      38012
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $6,918,301
Estimated Schedule Effort (organic) 28.682292 months
Estimated People Required (organic) 21.428982
───────────────────────────────────────────────────────────────────────────────
Processed 9425137 bytes, 9.425 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

```

The first is a tool called Sloc Cloc and Code (scc). Its a command line tool which given a code repository will count all of the lines of code, comments, blank lines in multiple programming languages. It has one additional trick which is that it will estimate cyclomatic complexity by counting branch statements in the code. However this is a point of contention. Saying that a file, or project has a complexity of 200 is not very useful without some context. Is that high? Low? So I thought I would start collecting data from a few repositories in order to have an idea what this value actually means. I started getting the most popular repositories I could think of and was running scc over those, before realizing this was going to take a while and I was going to miss out on a lot of languages and repository sizes. So I figured why not run it over every repository I could find?

Attempt 1

Most common filenames?

makefile	59,141,098
index	33,962,093
readme	22,964,539
jquery	20,015,171
main	12,308,009
package	10,975,828
license	10,441,647
_init_	10,193,245

Attempt 2

So for Kablamo I work with AWS a lot. So naturally the first thought was to use AWS Lambda. After all we know serverless has a lot going for it and should be cheap. I had previously set this up to produce those github badges you see allowing you to display things like the lines of code on your github project and you can see an example at the bottom here. This worked as follows. I setup a aws lambda function behind a ALB. Calls to it spawn a few subprocesses. The first being a shallow git clone into the tmp location of lambda. https://github.com/boyter/scc-lambda then it runs scc over the checked out code, and uploads the results into s3. Perfect solution you might think. Lambdas have a lot going for them. They are reasonably clean from a developer perspective because you don't have to manage any servers. They scale to thousands and thousands of concurrent requests and they generally don't cost very much. However there are of course some limitations, but I was prepared to live with. The first is that http lambdas like this have a hard 29 second timeout. Which chould be a problem for larger repositories. So I started by exporting all of the git urls, and wrote a simple python script to loop though every repository and ping my URL to generate the the data. After processing 1 million repositories I checked my AWS costs and noticed they were about $60. This meant to do all 10 million repositories I was looking at a $700 AWS bill. Ouch. As such I decided to rethink my solution. Keep in mind that was mostly storage and CPU, or what was needed to collect this information. Assuming I processed or exported the data it was going to increase the cost considerably.

How many “pure” projects

Another finding here. So I defined a pure project to mean a project that has 1 language in it. Of course that would not be very interesting by itself, so lets see what the spread is. As it turns out most projects have fewer than 25 languages in them with most in the less than 10 bracket. The peak in the graph is for 4 languages. Of course pure projects might only have one programming language, but have lots of supporting other formats such as markdown, json, yml, css, .gitignore which are picked up by scc. It's probably reasonable to assume that any project with less than 5 languages is “pure” and as it turns out is just over half the total data set. Of course your definition of purity might be different to mine so feel free to adjust to whatever number you like. What suprises me is an odd bump around 34-35 languages. I have no reasonable explanation as to why this might be the case and it probably warrents some investigation.

Attempt 3

So attempt 3. Since I was already in AWS the hip solution would be to dump the git repository locations as messages into SQS and pull from it using EC2 instances or fargate for processing. Then scale out like crazy. I actually mentioned this to people at work and thats the solution they also proposed. Cool, scale out to collect the data. I can do that. Collecting the data was one thing, but what about querying it? The idea was to use AWS Athena after putting the data into S3. We can then query the data directly in S3 which is cool! Plus its serverless and in theory should be cheap. Thankfully at the time I was working on a customer project using Athena, and I was able to use that to estimate what it would have cost at the time. So I ran what I expected the data size to be and it was going to cost something like $2.50 USD. Per query! Nobody gets their queries right at first and that's half the price of a coffee for a single For that dataset I quickly looked for an alternative. Also thinking about it, setting up the SQS queue, running instances... its going to be either click-ops development, or a pain to get everything setup properly with devops stacks and I was feeling lazy...

YAML or YML?

yaml	3,572,609
yml	14,076,349

Attempt 4

To paraphrase the infamous mongodb is web-scale video, as architects and developers, there is a tendency for us to read sites like High Scalability and start thinking we are google architects with google scale problems. There is a great post out there on this subject with the title "Taco Bell programming" Its core thesis is that you can solve most problems with the unix tool set, and a bit some judicious use of the pipe operator and redirects. The author gives a wonderful example about a 32 process crawler to download URL's. Now I don't subscribe entirely to that idea of doing everything in the shell, probably because I am not at all fluent at sed, but I do think that we need to stop thinking of millions to billions of rows, terabytes of data is a large scale problem. The original map-reduce paper by google came out in 2004 and hadoop came out in 2007. That's over 16 years ago. A lot of the showcases for what those projects could do can be easily done by a single machine these days. At the time I had some spare compute used by searchcode. It was fronted by a varnish cache accelerator, and if you have ever used varnish you know its very CPU efficient. As such the CPU on that box was doing the square root of zero most of the time, so I had a look at running the process on it. What I did was write a simple Go program that spawned 32 process which literally called out using spawned processes to shallow git clone to a temp directory then run scc over it, and save the resulting json file to disk and s3.

Attempt 4

Why Go?

Channels and Pipes

```


cat urllist.txt | xargs -P16 python parse.py

``` ```


ch := make(chan string)

for i:=0; i<16; i++{
    go parse(ch)
}

for _, l := range urllist {
    ch <- l
}

```

Files in a repo 95%

Processing...

Database?
Spinning rust...

The size of the data I needed to process raised another question. How does one process 10 million JSON files taking up just over 1 TB of disk space in an S3 bucket? Keeping in mind we already ruled out Athena due to cost. Firstly why it it in JSON? Thats just because thats the output from scc and I didn't feel like reprocessing it. Its a result of how the lambda was written where it saved the results to s3 so it could reuse them. The brains trust at work raised the idea to dump the data into a large SQL database. However this means processing the data into the database at least once, then running queries over it multiple times. The structure of the data meant having a few tables which means foreign keys and indexes to ensure some level of performance. This feels wasteful because we could just process the data as we read it from disk in a single pass, assuming we know what we want up front. I was also worried about building a database this large. With just data it would be over 1 TB in size before adding indexes. Probably not a problem if the HDD was a SSD, but for my box it was a spinning rust mechanical drive. So I pulled the data out of S3 once and stored each file into a giant tar file that I could process over and over. This was something I regret having to do because if I had just kept a local version I could have processed faster and saved on the s3 egress costs.

Go again...

```filesPerProject := map[int64]int64{} // Number of files in each project in buckets IE projects with 10 files or projects with 2 projectsPerLanguage := map[string]int64{} // Number of projects which use a language filesPerLanguage := map[string]int64{} // Number of files per language hasLicenceCount := map[string]int64{} // Count of if a project has a licence file or not fileNamesCount := map[string]int64{} // Count of filenames fileNamesNoExtensionCount := map[string]int64{} // Count of filenames without extensions fileNamesNoExtensionLowercaseCount := map[string]int64{} // Count of filenames tolower and no extensions complexityPerLanguage := map[string]int64{} // Sum of complexity per language commentsPerLanguage := map[string]int64{} // Sum of comments per language sourceCount := map[string]int64{} // Count of each source github/bitbucket/gitlab ymlOrYaml := map[string]int64{} // yaml or yml extension? mostComplex := Largest{} // Holds details of the most complex file mostComplexPerLanguage := map[string]Largest{} // Most complex of each file type mostComplexWeighted := Largest{} // Holds details of the most complex file weighted by lines NB useless because it only picks up minified files mostComplexWeightedPerLanguage := map[string]Largest{} // Most complex of each file type weighted by lines largest := Largest{} // Holds details of the largest file in bytes largestPerLanguage := map[string]Largest{} // largest file per language longest := Largest{} // Holds details of the longest file in lines longestPerLanguage := map[string]Largest{} // longest file per language mostCommented := Largest{} // Holds details of the most commented file in lines mostCommentedPerLanguage := map[string]Largest{} // most commented file per language```

Since I was already working in Go I decided to just process the data in Go. I could reuse the structures I had already written, and a lot of the same code for dealing with tar files. I didn't multithread this for a few reasons. The first was that it was disk bound to the reading of content off the disk, and the second was I didn't feel like making it thread safe. It turned out to be a simple program. Create some maps and arrays, then loop every json file incrementing the counts as needed based on some logic for whatever we are caluclating. Now this is not ideal from an analytics perspective, but it had a few advantages. The first was that I didn't need to parse the data into some format I could then query over. Which was fine so long as I knew what answers I wanted up front. The other was that it allowed me to do things that perhaps would not be easy or possible using something like sql and such.

The Java FactoryFactory

not factory	271,375,574	97.9%
factory	5,695,568	2.09%
factoryfactory	25,316	0.009%
factoryfactoryfactory	0	:(

Raw Numbers

9,985,051 total repositories
9,100,083 repositories with at least 1 identified file
884,968 empty repositories (those with no files)
3,529,516,251 files in all repositories
40,736,530,379,778 bytes processed (40 TB)
1,086,723,618,560 lines identified
816,822,273,469 code lines identified
124,382,152,510 blank lines identified
145,519,192,581 comment lines identified
71,884,867,919 complexity count according to scc rules

Potty Mouth

language	curse count	%
C Header	7,660	0.001%
Java	7,023	0.002%
C	6,897	0.001%
PHP	5,713	0.002%
JavaScript	4,306	0.001%
Dart	1,533	0.191%

Working this out is not an exact science. It falls into the NLP class of problems really. Picking up cursing/swearing or offensive terms using filenames from a defined list is never going to be effective. If you do a simple string contains test you pick up all sorts or normal files such as assemble.sh and such. So to produce the following I pulled a list of curse words, then checked if any files in each project start with one of those values followed by a period. While not accurate at all as mentioned it is incredibly fun to see what this produces. So lets start with a list of which languages have the most curse words. However we should probably weight this against how much code exists as well. So here are the top ones. Interesting! My first thought was “those naughty C developers!” but as it turns out while they have a high count they write so much code it probably isn't that big a deal. However pretty clearly Dart developers have an axe to grind! Anyone here a dart developer? Do you want a hug? If you know someone coding in Dart you may want to go offer them a hug.

Lessons Learnt

Don't store lots of files in tmp
Don't use s3 at first...
Consider compression, suzh as zstd.
Keep results locally!
When CPU is high for a long time consider dedicated
Use --depth=1 for shallow git clones
Go works well as glue code

Lessons learnt from this be careful when creation millions of files in /tmp/ if you ever reboot... because you end up locking your linux box for a long time while it clears those files. Ask me how searchcode went down for several hours... I actually ended up having to boot into a recovery mode and then find the fastest way to clear all the files to bring it back online after waiting a few hours for the reboot process to do this Don't bother with the s3 saving at first, stuff the json into a tar file to some limit, either size or number of files, then compress and shove it into s3 Consider other compression software... gzip is ubiquitous but zstd is probably the best other option offering better compression and faster performance, and for one off things like this really worth it If you are going to process outside of s3, be careful storing it there, because the cost to fetch it is really non trivial For tasks where you throw away the source data such as this, keep the results around locally! Even if you plan on storing them someone having a local copy can really help You don't have to use the "pro" tools if you don't want, especailly if its for something that isnt designed to be repeatable. Oh and for lots of git clones use depth=1 because it saves a lot of bandwidth. Lastly Go works pretty well as glue code where you would normally reach for python, without having to worry about deployments or in most cases performance very much. Its probably not the best option for data crunching, but again a simple binary deploy is fairly appealing.

Dedicated server go brrr

Missing a license

For any talk I attend I like to have a single thing I take away from it that is in some way valuable. I hope for a lot of you this one is it, if nothing else was interesting. Looking for licence files. This is flawed because it was only looking for files with the following names “license”, “licence”, “copying”, “copying3”, “unlicense”, “unlicence”, “license-mit”, “licence-mit” or “copyright”. So based on that only about 1/3 of repositories have a licence. This not exhaustive, because the files could have a license via some other means such as SPDX tags, or in the README file. However the results are still problematic. I have heard some people say they say without a licence is now public domain. I have two issues with this. The first reason is that its incorrect. You still need to explicitly say its public domain when you do this. The second is that for anyone with Australian or German citizenship (im guessing most of you here), you legally cannot give up your human rights, with copyright being considered a human right. So if you have that citizenship, you cannot put anything into public domain. For those wondering WHY we have this law, Germany implemented this right after WW2 so as you might imagine its there for a good reason. So if you want your work to be as open as possible you must be explicit with your license. Consider MIT or Unlicence if you want thinks to be as free as possible.

The Future?

Keeping the URL properly...
Forget S3
Trie to store filenames (avoid lossy)
Use SQLite?
Latest version of scc
Determine maintainability**
Tabs vs Spaces
Don't store as JSON
Try bigquery?
Another language? Rust? Zig?
Explorer!

I plan on doing this again at some point with the latest version of scc and an updated list of projects from searchcode. I think the first thing to do is forget s3 and look at packing data into tar files with gz up to some limit. The results compress well and could really expand out the number of results you can process. Id like to try sqlite as the store as well, since that would make for a great blog post title... Id love to know about and algorithm to determine maintainability. I have been reading a lot of academic papers on this but yet to find something useful. Bigquery is one of those things that should eat something like this... Id like to actually process the results too and provide a webpage you could use to compare your project to millions of others. Maybe as a way to see what projects are like yours! Throw in some vector comparisons for another DataEngBytes talk next year perhaps?

Thank you!

https://boyter.org/posts/an-informal-survey-of-10-million-github-bitbucket-gitlab-projects/
https://boyter.org/
https://github.com/boyter/scc-data
https://github.com/boyter/scc
https://news.ycombinator.com/item?id=21121735
https://boyter.org/static/dataenbytes2023/