Today I was wondering what the most commonly used license that people use in OpenAPI, so I went and did a quick analysis.

Results

The top 5 (with count and percentage; n=552):

License name count percentage
CC-BY-3.0 250 45,29%
Apache-2.01 218 39,49%
MIT 15 2,71%
“This page was built with the Swagger API.” 8 1,44%
“Open Government License – British Columbia” 6 1,09%

The striked-out entries are the ones that I would not really consider a proper license.

The license names inside quotation marks are the exact copy-paste from the field. The rest are de-duplicated into their SPDX identifiers.

After those top 5 the long end goes very quickly into only one license per listed API. Several of those seem very odd as well.

Methodology

Note

Before you start complaining, I realise this is probably a very sub-optimal solution code-wise, but it worked for me. In my defence, I did open up my copy of the Sed & Awk Pocket Reference before my eyes went all glassy and I hacked up the following ugly method. Also note that the shell scripts are in Fish shell and may not work directly in a 100% POSIX shell.

Also do see below for a revised method.

First, I needed to get a data set to work on. Hat-tip to Mike Ralphson for pointing me to APIs Guru as a good resource. I analysed their APIs-guru/openapi-directory repository2, where in the APIs folder they keep a big collection of public APIs. Most of them following the OpenAPI (previously Swagger) specification.

git clone https://github.com/APIs-guru/openapi-directory.git
cd openapi-directory/APIs

Next I needed to list all the licenses found there. For this I assumed the name: tag in YAML4 (the one including the name of the license) to be in the very next line after the license: tag3 – I relied on people writing OpenAPI files in the same order as it is laid out in the OpenAPI Specification. I stored the list of all licenses, sorted alphabetically in a separate api_licenses file:

grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \
grep 'name:' | sort > api_licenses

Then I generated another file called api_licenses_unique that would include only all names of these licenses.

grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \
grep 'name:' | sort | uniq > api_licenses_unique

Because I was too lazy to figure out how to do this properly5, I simply wrapped the same one-liner into a script to go through all the unique license names and count how many times they show up in the (non-duplicated) list of all licenses found.

for license in (grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 \
--no-filename | grep 'name' | sort | uniq)
                           grep "$license" api_licenses --count
                       end

In the end I copied the console output of this last command, opened api_licenses_unique, and pasted said output in the first column (by going into Block Selection Mode in Kate).

Clarification on what I consider “proper license” and re-count of Creative Commons licenses (12 July 2019 update)

I was asked what I considered as a “proper license” above, and specifically why I did not consider “Creative Commons” as such.

First, if the string did not even remotely look like a name of a license, I did not consider that as a proper license. This is the case e.g. with “This page was built with the Swagger API.”.

As for the string “Creative Commons”, it – at best – indicates a family o licenses, which span a vast spectrum from CC0-1.0 (basically public domain) on one end to CC-BY-NC-CA-4.0 (basically, you may copy this, but not change anything, nor get money out of it, and you must keep the same license) on the other. For reference, on the SPDX license list, you will find 32 Creative Commons licenses. And SPDX lists only the International and Universal versions of them7.

Admiteldy, – and this is a caveat in my initial method above – it may be that there is an actual license following the lines after the “Creative Commons” string … or, as it turned out to be true, that the initial 255 count of name: Creative Commons licenses included also valid CC license names such as name: Creative Commons Attribution 3.0.

So, obviously I made a boo-boo, and therefore went and dug deeper ;)

To do so, and after looking at the results a bit more, I noticed that the url: entries of the name: Creative Commons licenses seem to point to actual CC licenses, so I decided to rely on that. Luckily, this turned out to be true.

I broadened up the initial search to one extra line, to include the url: line, narrowed down the next search to name: Creative Commons, and in the end only to url:

grep 'license:' **/openapi.yaml **/swagger.yaml -A 2 --no-filename | \
grep 'name: Creative Commons' -A 1 | grep 'url' | sort > api_licenses_cc

Next, I searched for the most common license – CC-BY-3.0:

grep --count 'creativecommons.org/licenses/by/3.0' api_licenses_cc

The result was 250, so for the remaining6 5 I just opened the api_licenses_cc file and counted them manually.

Using this method the list of all “Creative Commons” license turned out to be as follows:

  1. CC-BY-3.0 (250, of which one was specific to Australian jurisdiction)
  2. CC-BY-4.0 (3)
  3. CC-BY-NC-4.0 (1)
  4. CC-BY-NC-ND-2.0 (1)

In this light, I am amending the results above, and removing the bogus “Creative Commons” entry. Apart from removing the bogus entry, it does not change the ranking, nor the counts, of the top 5 licenses.

Further clean-up of Apache (17 July 2019 update)

Upon further inspection it looked odd that I was getting so many Apache-2.0 matches – if you added all the Apache-2.0 hits (initially 421) with all the CC-BY-3.0 hits (250), you already reached a higher number than all the occurrances of the license: field in all the files (552). Clearly something was off.

So I re-counted the Apache hits by limiting myself only to the url: field of the license:, instead of the name: and came to a half of the original number. Which brought it from first down to second place. Basically I applied the same method as above for counting Creative Commons licenses.

Better method (25 July 2019 update)

I just learnt from Jaka “Lynx” Kranjc of a better solution. Basically, I could cut down quite a bit by simply using uniq --count, which produces a unique list and prepends a column of how many times it found that occurance – super useful!

I will not edit my findings above again, but am mentioning the better method below, together with the attached results, so others can simply check.

grep 'license:' **/openapi.yaml **/swagger.yaml -A 1 --no-filename | \
grep 'name:' |  uniq -c | sort > OpenAPI_grouped_by_license_name.txt

… produces OpenAPI_grouped_by_license_name.txt

grep 'license:' **/openapi.yaml **/swagger.yaml -A 2 --no-filename | \
grep 'url:' |  uniq -c | sort > OpenAPI_grouped_by_license_url.txt

… produces OpenAPI_grouped_by_license_url.txt

hook out → not proud of the method, but happy with having results


  1. This should come as no surprise, as Apache-2.0 is used as the official specification’s example

  2. At the time of this writing, that was commit 506133b

  3. I tried it also with 3 lines, and the few extra results that came up where mostly useless. 

  4. I did a quick check and the repository seems to include no OpenAPIs in JSON format. 

  5. I expected for license in api_licenses_unique to work, but it did not. 

  6. The result of wc -l api_licenses_cc was 255. 

  7. Prior to version 4.0 of Creative Commons licenses each CC license had several versions localised for specific jurisdictions. 


Related Posts


Published

Last Updated

Category

Ius

Tags