Today I was wondering what the most commonly used license that people use in OpenAPI, so I went and did a quick analysis.
The top 5 (with count and percentage; n=552):
|“Open Government License – British Columbia”||6||1,09%|
The striked-out entries are the ones that I would not really consider a proper license.
The license names inside quotation marks are the exact copy-paste from the field. The rest are de-duplicated into their SPDX identifiers.
After those top 5 the long end goes very quickly into only one license per listed API. Several of those seem very odd as well.
Note: Before you start complaining, I realise this is probably a very sub-optimal solution code-wise, but it worked for me. In my defence, I did open up my copy of the Sed & Awk Pocket Reference before my eyes went all glassy and I hacked up the following ugly method. Also note that the shell scripts are in Fish shell and may not work directly in a 100% POSIX shell.
First, I needed to get a data set to work on. Hat-tip to Mike Ralphson for pointing me to APIs Guru as a good resource. I analysed their APIs-guru/openapi-directory repository2, where in the
APIs folder they keep a big collection of public APIs. Most of them following the OpenAPI (previously Swagger) specification.
git clone https://github.com/APIs-guru/openapi-directory.git cd openapi-directory/APIs
Next I needed to list all the licenses found there. For this I assumed the
name: tag in YAML4 (the one including the name of the license) to be in the very next line after the
license: tag3 – I relied on people writing OpenAPI files in the same order as it is laid out in the OpenAPI Specification. I stored the list of all licenses, sorted alphabetically in a separate
grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \ grep 'name:' | sort > api_licenses
Then I generated another file called
api_licenses_unique that would include only all names of these licenses.
grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \ grep 'name:' | sort | uniq > api_licenses_unique
Because I was too lazy to figure out how to do this properly5, I simply wrapped the same one-liner into a script to go through all the unique license names and count how many times they show up in the (non-duplicated) list of all licenses found.
for license in (grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 \ --no-filename | grep 'name' | sort | uniq) grep "$license" api_licenses --count end
In the end I copied the console output of this last command, opened
api_licenses_unique, and pasted said output in the first column (by going into Block Selection Mode in Kate).
Clarification on what I consider “proper license” and re-count of Creative Commons licenses (12 July 2019 update)¶
I was asked what I considered as a “proper license” above, and specifically why I did not consider “Creative Commons” as such.
First, if the string did not even remotely look like a name of a license, I did not consider that as a proper license. This is the case e.g. with “This page was built with the Swagger API.”.
As for the string “Creative Commons”, it – at best – indicates a family o licenses, which span a vast spectrum from CC0-1.0 (basically public domain) on one end to CC-BY-NC-CA-4.0 (basically, you may copy this, but not change anything, nor get money out of it, and you must keep the same license) on the other. For reference, on the SPDX license list, you will find 32 Creative Commons licenses. And SPDX lists only the International and Universal versions of them7.
Admiteldy, – and this is a caveat in my initial method above – it may be that there is an actual license following the lines after the “Creative Commons” string … or, as it turned out to be true, that the initial 255 count of
name: Creative Commons licenses included also valid CC license names such as
name: Creative Commons Attribution 3.0.
So, obviously I made a boo-boo, and therefore went and dug deeper ;)
To do so, and after looking at the results a bit more, I noticed that the
url: entries of the
name: Creative Commons licenses seem to point to actual CC licenses, so I decided to rely on that. Luckily, this turned out to be true.
I broadened up the initial search to one extra line, to include the
url: line, narrowed down the next search to
name: Creative Commons, and in the end only to
grep 'license:' **/openapi.yaml **/swagger.yaml -A 2 --no-filename | \ grep 'name: Creative Commons' -A 1 | grep 'url' | sort > api_licenses_cc
Next, I searched for the most common license – CC-BY-3.0:
grep --count 'creativecommons.org/licenses/by/3.0' api_licenses_cc
The result was 250, so for the remaining6 5 I just opened the
api_licenses_cc file and counted them manually.
Using this method the list of all “Creative Commons” license turned out to be as follows:
- CC-BY-3.0 (250, of which one was specific to Australian jurisdiction)
- CC-BY-4.0 (3)
- CC-BY-NC-4.0 (1)
- CC-BY-NC-ND-2.0 (1)
In this light, I am amending the results above, and removing the bogus “Creative Commons” entry. Apart from removing the bogus entry, it does not change the ranking, nor the counts, of the top 5 licenses.
Further clean-up of Apache (17 July 2019 update)¶
Upon further inspection it looked odd that I was getting so many Apache-2.0 matches – if you added all the Apache-2.0 hits (initially 421) with all the CC-BY-3.0 hits (250), you already reached a higher number than all the occurrances of the
license: field in all the files (552). Clearly something was off.
So I re-counted the Apache hits by limiting myself only to the
url: field of the
license:, instead of the
name: and came to a half of the original number. Which brought it from first down to second place. Basically I applied the same method as above for counting Creative Commons licenses.
Better method (25 July 2019 update)¶
I just learnt from Jaka “Lynx” Kranjc of a better solution. Basically, I could cut down quite a bit by simply using
uniq --count, which produces a unique list and prepends a column of how many times it found that occurance – super useful!
I will not edit my findings above again, but am mentioning the better method below, together with the attached results, so others can simply check.
grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \ grep 'name:' | uniq -c | sort > OpenAPI_grouped_by_license_name.txt
grep 'license:' **/openapi.yaml **/swagger.yaml -A 2 --no-filename | \ grep 'url:' | uniq -c | sort > OpenAPI_grouped_by_license_url.txt
hook out → not proud of the method, but happy with having results
I tried it also with 3 lines, and the few extra results that came up where mostly useless. ↩
I did a quick check and the repository seems to include no OpenAPIs in JSON format. ↩
for license in api_licenses_uniqueto work, but it did not. ↩
The result of
wc -l api_licenses_ccwas 255. ↩
Prior to version 4.0 of Creative Commons licenses each CC license had several versions localised for specific jurisdictions. ↩