The RAIL licenses are gaining ground, but permissive sharing is still the prominent norm governing the sharing of ML models on huggingface.co. This analysis aims at understanding how licenses are used by developers making ML model-related code and or data publicly available.
In the world of open, one of the most impactful events of 2022 has been the release of a number of powerful machine learning models under open licenses.1 The release of the BLOOM LLM and the Stable Diffusion image generation model under permissive licenses has led to significant downstream use. While many observers had long assumed that large-scale machine learning models could only be afforded by large organizations with access to vast amounts of private data and sufficient economic resources to afford the computational power required to build them, these developments have shown that this assumption was wrong. Instead, there has been a veritable explosion of creativity based on models released under open licenses.
At the center of this development has been a new class of open — or, as some observers might say, ‘open(ish)’2 — licenses – the Open RAIL family of licenses.3 These licenses combine elements of permissive open source software licenses with provisions that restrict certain uses of the licensed resources. These restrictions are largely based on ethical considerations and are intended to ensure that the licensed models are used responsibly – hence the name Responsible AI licenses (RAIL).4
The emergence and use of these licenses have attracted considerable attention from many observers,5 much of which has focused on how these licenses work and to what extent they can actually be effective in ensuring the responsible use of publicly available AI models.6
In this paper, we develop a quantitative understanding of the use of open licenses for sharing ML models — and in particular, the use of the Open RAIL licenses. We have done this by conducting a historical analysis of publicly available license information on huggingface.co, which currently is one of the most widely used repositories for openly licensed ML models.
With this analysis, we seek to understand how licenses are used by developers making ML model-related code and or data publicly available. Our initial scan showed that projects in this space use a large variety of licenses that vary in type7 and licensing conditions. Regarding licensing conditions, our main interest was to understand if there are clear patterns with regard to license choice when it comes to publicly sharing ML models and to understand the impact of the emergence of the Open RAIL licenses. Our intrest in the use of the Open RAIL licenses is also driven by the fact that their use could be understood as type of self regulation based on community norms in a field that is so far largely dominated by permissive open source software licenses8.
With regards to license types, we were interested to understand if there are any clear patterns explaining the choice of either software licenses or content/data licenses. However, after our initial data gathering, we abandoned our efforts to understand patterns related to the type of uses. Based on the data that we obtained from huggingface.co, we did not observe any noteworthy patterns and the data that we obtained did not allow us to make inferences on the motivation of repository owners for choosing a specific type of license. It remains an interesting observation that repositories relating to AI models are licensed under a mix of (open source) software licenses and open content/data licenses. This likely reflects the reality that repositories in the model category on huggingface.co are used to store both training data (for which open content/data licenses are most appropriate) and models (for which software licenses are more appropriate).
As a result, we focused our analysis on understanding the use of different licensing conditions and how it has evolved since huggingface.co started to make snapshots of repository metadata available in September 2022.
For this paper, we analyzed data from four of these snapshots:
For each of these snapshots, we have extracted information about the licenses in use and the number of downloads in the past 30 days for all of the repositories published in the models’ category on huggingface.co. We have then sorted the individual licenses into the following five categories and grouped the repositories according to these categories:
Permissive / BY (permissive open source software licenses and attribution-style open content and open data licenses);
Copyleft / SA (Copyleft software licenses and Share Alike style open content and open data licenses);
Public Domain (Public Domain dedications and licenses that come with no restrictions or conditions);
Open RAIL (Licenses from the Open RAIL family of licenses, including the original RAIL license);
Non-commercial / NC use only (software, content and data licenses that do not allow commercial use);
Other restrictions (licenses that come with other restrictions, for example, prohibitions on creating derivatives).
A more detailed explanation of the categories, including an overview of individual licenses per category, can be found in the GitHub repository containing the collected data.9
This resulted in a dataset containing between 24.420 (
27-09-2022) and 39.115 (
24-01-2023) repositories with licensing information attached. The majority of these repositories have had none or very few downloads. Since we are mainly interested in understanding the use of licenses for ML models that are actively used by other developers, we have then created a subset of repositories that exceed a minimum download threshold.
To do this, we have selected those repositories that have had at least 30 cumulative downloads during the four 30-day periods preceding the snapshots. This has resulted in a second dataset containing between 7.343 (
27-09-2022) and 11.308 repositories (
24-01-2023). A full description of the method used for selecting and categorizing the data can be found in our GitHub repository.10
In both datasets, we see a clear trend regarding the use of the Open RAIL licenses. The percentage of repositories made available under an Open RAIL license as a percentage of all repositories with licensing information has increased from 0.54% on
27-09-2022 to 9.81% on
24-01-2023 (see figure 1). Similarly, the percentage of repositories made available under an Open RAIL license as a percentage of those repositories with at least 30 cumulative downloads has increased from 0.43% on
27-09-2022 to 7.1% on
24-01-2023 (see figure 2).
In both datasets, the Open RAIL licenses are now the second most used license category. They are, however, a distant second to permissive open source software licenses that account for the vast majority of repositories in both datasets: 82.5% of repositories with at least 30 cumulative downloads and 80.2% of all repositories with licensing information.
The dominance is even starker when looking at the total number of downloads. Here permissive open source licenses account for 89.4% of all downloads of repositories with at least 30 cumulative downloads. The share of downloads for repositories licensed under Open RAIL licenses amounts to only 3.96% (see figure 3).
Based on the data we have collected, it is clear that the Open RAIL licenses have rapidly established themselves as a new type of license playing a vital role in the Open Source ML development field. In less than half a year, their use has overtaken the use of most established licensing categories that have been designed to safeguard community norms. They have surpassed both copyleft licenses (that seek to ensure that the benefits from access to licensed resources accrue only to those who contribute to the further development of the resources) and licenses that do not authorize commercial uses of the licensed resources (intended to prevent commercial exploitation of the licensed resources while encouraging non-commercial uses).
This could be indicative of the emergence of a new set of community norms among ML developers and researchers that deal with perceived societal dangers of the use of AI/ML models. While this is the stated rationale of the creators of the Open RAIL licenses, the data we have collected does not allow us to draw these conclusions yet. There are at least two possible explanations for the relatively rapid rise in the use of Open RAIL licenses in the field:
(1) The observed increased use of Open RAIL could indeed be reflective of a developing set of community norms that results in developers deliberately choosing to release models under these licenses because they reflect their norms, or (2) The observed increased use of Open RAIL licenses could be due to the use of this licenses for a small number of highly visible foundational models that have generated a lot of downstream uses. In this case, the increased use of these licenses would reflect the overall importance of these foundational models for the fields; as for popular models, licensing decisions result in alignment for other projects that are derived from them.
There are indications for the latter hypothesis as a lot of the repositories licensed under Open RAIL licenses seem to be related to models released by Stability.ai. To better understand the reason for the observed increased use of Open RAIL licenses, it will therefore be necessary to analyze the relationship and dependencies between repositories making use of these licenses. Such an analysis has been out of the scope of the analysis that we have conducted so far.
At the current stage, it is also too early to assess if the concerns encoded in the Open RAIL licenses will have a lasting impact on open source ML development.11 For now, the field is dominated by permissive open source licenses that are designed to reduce friction by minimizing restrictions and conditions for uses. This points to a set of community norms that prioritize technological progress and rapid iteration over responsible use of the technology.
Overall this means that the underlying idea to promote a responsible approach to AI development by embedding safeguards into licensing practices that have led to the development of the Open RAIL licenses has had a limited impact so far. With the field being dominated by permissive open source licenses, this means that calls for a regulation aimed at ensuring the responsible development of ML technologies will likely be met in the form of external regulation, such as the European Union's proposal for an AI Act12. An interesting consequence of this situation might be that broad regulatory efforts such as the AI act might impose conditions on the development of ML systems that will be hard to comply with for developers of openly licensed ML models.13 In such a situation, the Open RAIL licenses might end up being less of a tool for self-regulation based on community norms and become a tool for regulatory compliance.