A simple blog post here to introduce a new command line tool licensechecker (lc), which is similar in purpose to the library Licensee http://ben.balter.com/licensee/ which attempts to identify what software license(s) code is released under. lc itself is dual licensed under the MIT and Unlicense.
Licensechecker (lc) was designed to be run either on your command line or using your CI tool of choice and produce either CSV, JSON, SPDX, Summary or CLI tabular or progress output with the option to write results to disk. It is reasonably fast and is cross platform with binaries for x64 versions of Linux, macOS and Windows. The build process also ensures that it builds on ARM and i386.
You can find both the binaries and code on GitHub https://github.com/boyter/lc/
Some time ago I wrote a collection of Python scripts which would when tweaked would scan a file directory and tell you what licenses the files that existed within in were under http://www.boyter.org/2017/05/identify-software-licenses-python-vector-space-search-ngram-keywords/
Licensechecker has taken the ideas I toyed around in Python and produced a much more battle tested command line application written in Go which is ready for general use. I chose to write it in Go simply because at my current role there is a push to use Go and I have a lack of experience with it.
It is reasonably well tested (always room for improvement), and while there are some easy wins that could be done for performance improvements it is reasonably fast.
Some sample outputs of what it can produce follow.
Example output of licencechecker running against itself in tabular format while ignoring the .git, licenses and vendor directories
$ lc -pbl .git,vendor,licenses -f tabular . Directory File License Confidence Size . .gitignore (MIT OR Unlicense) 100.00% 275B . .travis.yml (MIT OR Unlicense) 100.00% 188B . CODE_OF_CONDUCT.md (MIT OR Unlicense) 100.00% 3.1K . CONTRIBUTING.md (MIT OR Unlicense) 100.00% 1.2K . Gopkg.lock (MIT OR Unlicense) 100.00% 1.4K . Gopkg.toml (MIT OR Unlicense) 100.00% 972B . LICENSE Unlicense AND MIT 94.83% 1.1K . README.md (MIT OR Unlicense) 100.00% 7.5K . UNLICENSE MIT AND Unlicense 95.16% 1.2K . database_keywords.json (MIT OR Unlicense) 100.00% 3.6M . main.go (MIT OR Unlicense) 100.00% 3.4K . what-we-look-at.md (MIT OR Unlicense) 100.00% 3.2K examples/identifier LICENSE GPL-3.0+ AND MIT 95.40% 1K examples/identifier LICENSE2 MIT AND GPL-3.0+ 99.65% 35K examples/identifier has_identifier.py (MIT OR GPL-3.0+) AND GPL-2.0 100.00% 409B parsers constants.go (MIT OR Unlicense) 100.00% 4.8M parsers formatter.go (MIT OR Unlicense) 100.00% 7.8K parsers formatter_test.go (MIT OR Unlicense) 100.00% 944B parsers guesser.go (MIT OR Unlicense) 100.00% 9.8K parsers guesser_test.go (MIT OR Unlicense) 100.00% 3.4K parsers helpers.go (MIT OR Unlicense) AND Apache-2.0 100.00% 2.4K parsers helpers_test.go (MIT OR Unlicense) 100.00% 1.5K parsers structs.go (MIT OR Unlicense) 100.00% 679B scripts build_database.py (MIT OR Unlicense) 100.00% 4.6K scripts include.go (MIT OR Unlicense) 100.00% 951B
To write out the results to a CSV file
$ lc --format csv -output licences.csv --pathblacklist .git,licenses,vendor .
Or to a valid SPDX 2.1 file
$ lc -f spdx -o spdx_example.spdx --pbl .git,vendor,licenses -dn licensechecker -pn licensechecker .
You can specify multiple directories as additional arguments and all results will be merged into a single output
$ lc -f tabular ./examples/identifier ./scripts
You can also specify files and directories as additional arguments
$ lc -f tabular README.md LICENSE ./examples/identifier Directory File License Confidence Size README.md NOASSERTION 100.00% 7.5K LICENSE MIT 94.83% 1.1K ./examples/identifier LICENSE GPL-3.0+ AND MIT 95.40% 1K ./examples/identifier LICENSE2 MIT AND GPL-3.0+ 99.65% 35K ./examples/identifier has_identifier.py (MIT OR GPL-3.0+) AND GPL-2.0 100.00% 428B
What follows is a brief overview of how it currently works.
Licencechecker works by comparing files against the list of licenses supplied by SPDX.org with the addition of the Fair Source License which I needed for my own purposes.
First the command line arguments are checked to see if they refer to file or a folder. The difference between the two is that if it is a folder the whole folder is first scanned to see if it can identify any files which indicate a license. This can be controlled using the argument licensefiles which by default is set to look for filenames which contain the string license, copying or readme. For example the following files would be identified as potentially containing a license.
LICENSE LICENCE license.md COPYING.txt LICENSE-MIT COPYRIGHT UNLICENSE README.md
These are then taken as potential root licenses under which all other files would be marked against.
If there are multiple root licenses then they are treated using OR as through the project is dual licensed. The check for a root license happens in every folder.
When a candidate for a license is found its contents are checked against a list of unique ngrams for licenses. If there any ngrams matching then each license is checked using a vector space to obtain the fuzzy match value for that license. If the value is over the confidence value set which by default is 85% then it is marked as being a match.
If no license is found using the previous step then the file is checked against all licenses using the fuzzy matching. This is because some licenses do not have any unique ngrams allowing identification. Fuzzy matching only looks at the top portion of the file where license headers are expected to exist.
At the end the matches if any are sorted by and the most likely match is returned as the matching license. As such a license file is considered to only contain a single declared license.
Note that only the most recent root licenses are taken into account, so if a project with a root license of MIT has a sub folder with a root license of GPL-2.0+ files in that root folder will be marked as being GPL-2.0+ and not MIT.
For individual files the file is scanned in the same way but in addition is scanned for any SPDX indicators such as,
Which will take precedence over any fuzzy matching. The indicators must match known SPDX licenses or they will be disregarded.
When finished the license determined is based on the SDPX identifiers if present, fuzzy matching if over the confidence value and then the root license(s). If multiples match they are treated as an AND.
Take for example,
Directory File License Confidence Size ./examples/identifier has_identifier.py (MIT OR GPL-3.0+) AND GPL-2.0 100.00% 428B
The root licenses were identified as being both MIT and GPL-3.0+ however inside the code itself it has a GPL-2.0 identifier. As such the license of the file is either MIT AND GPL-2.0 OR GPL-3.0+ OR GPL-2.0. The indicator for the license like this is based on SPDX examples and as specified in the SPDX specification https://spdx.org/spdx-specification-21-web-version
Currently licencechecker (lc) does not indicate if a project may be in violation of license requirements. Assuming there is some sort of license compatibility chart in the future this would be something to add at a later point.
There are a few known issues with the current version. License’s are fuzzy matched so its possible someone could fork an existing license and have it be a false positive match. The second issue is that license matches are based on licenses from SPDX and as such may miss some licenses.
If you have read this far thanks for your time. Id love you to try licencechecker and report back any bugs. I would also love to hear how you are using it and with permission put a link in the README with your testimonial and link. If you want to contribute back use GitHub issues. As with anything I put online I am happy to get constructive feedback.