Processing 40 TB of code from ~10 million projects with a dedicated server and Go for $100

Intro?

I blog boyter.org I free software github/boyter/ I run searchcode.com also on the twitter @boyter activitypub @boyter@honk.boyter.org

Largest dataset: ~6PB
Largest table: 2+ trillion rows
Highest QPS: 70,000/s under a DDoS

Outline

What did I try?
What did I learn?
What worked?
The future?
Findings...

Why?

Why would anyone in their right mind do this?

scc

```$ scc redis ─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── C 296 180267 20367 31679 128221 32548 C Header 215 32362 3624 6968 21770 1636 TCL 143 28959 3130 1784 24045 2340 Shell 44 1658 222 326 1110 187 Autoconf 22 10871 1038 1326 8507 953 Lua 20 525 68 70 387 65 Markdown 16 2595 683 0 1912 0 Makefile 11 1363 262 125 976 59 Ruby 10 795 78 78 639 116 gitignore 10 162 16 0 146 0 YAML 6 711 46 8 657 0 HTML 5 9658 2928 12 6718 0 C++ 4 286 48 14 224 31 License 4 100 20 0 80 0 Plain Text 3 185 26 0 159 0 CMake 2 214 43 3 168 4 CSS 2 107 16 0 91 0 Python 2 219 12 6 201 34 Systemd 2 80 6 0 74 0 BASH 1 118 14 5 99 31 Batch 1 28 2 0 26 3 C++ Header 1 9 1 3 5 0 Extensible Styleshe… 1 10 0 0 10 0 Smarty Template 1 44 1 0 43 5 m4 1 562 116 53 393 0 ─────────────────────────────────────────────────────────────────────────────── Total 823 271888 32767 42460 196661 38012 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop (organic) $6,918,301 Estimated Schedule Effort (organic) 28.682292 months Estimated People Required (organic) 21.428982 ─────────────────────────────────────────────────────────────────────────────── Processed 9425137 bytes, 9.425 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── ```

Attempt 1

Most common filenames?

makefile59,141,098
index33,962,093
readme22,964,539
jquery20,015,171
main12,308,009
package10,975,828
license10,441,647
_init_10,193,245

Attempt 2


How many “pure” projects

Attempt 3


YAML or YML?

yaml3,572,609
yml14,076,349

Attempt 4


Attempt 4


Attempt 4


Why Go?


Channels and Pipes

``` cat urllist.txt | xargs -P16 python parse.py ``` ``` ch := make(chan string) for i:=0; i<16; i++{ go parse(ch) } for _, l := range urllist { ch <- l } ```

Files in a repo 95%

Processing...

Database?
Spinning rust...

Go again...

``` filesPerProject := map[int64]int64{} // Number of files in each project in buckets IE projects with 10 files or projects with 2 projectsPerLanguage := map[string]int64{} // Number of projects which use a language filesPerLanguage := map[string]int64{} // Number of files per language hasLicenceCount := map[string]int64{} // Count of if a project has a licence file or not fileNamesCount := map[string]int64{} // Count of filenames fileNamesNoExtensionCount := map[string]int64{} // Count of filenames without extensions fileNamesNoExtensionLowercaseCount := map[string]int64{} // Count of filenames tolower and no extensions complexityPerLanguage := map[string]int64{} // Sum of complexity per language commentsPerLanguage := map[string]int64{} // Sum of comments per language sourceCount := map[string]int64{} // Count of each source github/bitbucket/gitlab ymlOrYaml := map[string]int64{} // yaml or yml extension? mostComplex := Largest{} // Holds details of the most complex file mostComplexPerLanguage := map[string]Largest{} // Most complex of each file type mostComplexWeighted := Largest{} // Holds details of the most complex file weighted by lines NB useless because it only picks up minified files mostComplexWeightedPerLanguage := map[string]Largest{} // Most complex of each file type weighted by lines largest := Largest{} // Holds details of the largest file in bytes largestPerLanguage := map[string]Largest{} // largest file per language longest := Largest{} // Holds details of the longest file in lines longestPerLanguage := map[string]Largest{} // longest file per language mostCommented := Largest{} // Holds details of the most commented file in lines mostCommentedPerLanguage := map[string]Largest{} // most commented file per language ```

The Java FactoryFactory

not factory271,375,57497.9%
factory5,695,5682.09%
factoryfactory25,3160.009%
factoryfactoryfactory0:(

Raw Numbers

9,985,051 total repositories
9,100,083 repositories with at least 1 identified file
884,968 empty repositories (those with no files)
3,529,516,251 files in all repositories
40,736,530,379,778 bytes processed (40 TB)
1,086,723,618,560 lines identified
816,822,273,469 code lines identified
124,382,152,510 blank lines identified
145,519,192,581 comment lines identified
71,884,867,919 complexity count according to scc rules

Potty Mouth

language curse count %
C Header 7,660 0.001%
Java 7,023 0.002%
C 6,897 0.001%
PHP 5,713 0.002%
JavaScript 4,306 0.001%
Dart 1,533 0.191%

Lessons Learnt

Don't store lots of files in tmp
Don't use s3 at first...
Consider compression, suzh as zstd.
Keep results locally!
When CPU is high for a long time consider dedicated
Use --depth=1 for shallow git clones
Go works well as glue code

Dedicated server go brrr

Missing a license

The Future?

Keeping the URL properly...
Forget S3
Trie to store filenames (avoid lossy)
Use SQLite?
Latest version of scc
Determine maintainability**
Tabs vs Spaces
Don't store as JSON
Try bigquery?
Another language? Rust? Zig?
Explorer!

Thank you!

https://boyter.org/posts/an-informal-survey-of-10-million-github-bitbucket-gitlab-projects/
https://boyter.org/
https://github.com/boyter/scc-data
https://github.com/boyter/scc
https://news.ycombinator.com/item?id=21121735
https://boyter.org/static/dataenbytes2023/