Skip to main content
SearchLoginLogin or Signup

Notes on BLOOM, RAIL and openness of AI

These notes offer an initial analysis of a growing movement that strives to make key machine learning tools and technologies open and shared.

Published onSep 09, 2022
Notes on BLOOM, RAIL and openness of AI
·

These notes offer an initial analysis of a growing movement that strives to make key machine learning tools and technologies open and shared. “Open Artificial Intelligence”1, a new field of open, is emerging and orienting itself against previous approaches to open source programming, open access science and open data. 

This analysis is presented as a request for comments. We will be grateful for feedback on our insights regarding how open approaches to Artificial Intelligence signal the broader space of open licensing, and the Open Movement.

BLOOM is a large language model that was released in July under the RAIL license – a new copyright license that combines an Open Access approach to licensing with behavioral restrictions aimed to enforce a vision of responsible AI use. Both the model and the license are the result of the work of BigScience, a network of over 1000 AI researchers facilitated by HuggingFace, the company that aims to “democratize good machine learning”.

The release is an important turning point in terms of how AI technologies are managed and shared, and a symbolic starting date for an emergent new field of openness: Open AI. In April this year, Meta released its own large language model, OPT, under a similarly designed license. And in August, the Stable Diffusion text-to-image model was released under CreativeML Open RAIL-M, a modified version of the RAIL license. As a backdrop to these developments, the incumbent large language and image generation models like GPT, Midjourney or Dall·e, have been kept proprietary and made available only under permissioned access. 

The decisions made regarding the governance of the model, and design of the new licenses, show a new approach to sharing that will have an impact beyond the field of machine learning research and development. It will affect other “fields of open” – distinct spheres of activity, to which open sharing frameworks have been applied, and advocated for. 

The new licenses, and associated sharing frameworks, strive not just to secure openness of resources, but also ethical uses and responsibility for their impact. As such, they offer a new perspective on the Paradox of Open: dealing with the contradictions, and managing power imbalances caused by open sharing. And they signal an urgent need to revisit open licensing frameworks.

A holistic approach to “open AI” also requires a consideration of other parts of the AI technological stack, and ways of sharing them: for example, sharing of training datasets (which we are investigating with our AI_Commons initiative) or openness of algorithms – they are outside the scope of this analysis.

Large Open-science Open-Access Language Model

BLOOM is a new large language model (also called a foundation model), an algorithm that is a crucial part of the AI technological stack. Such models can generate (or more precisely, continue) text – in the case of BLOOM, in 46 natural and 13 programming languages. BLOOM is meant as an open and collaboratively developed alternative to other models that are either closed (like GPT-3), or open but proprietary (like OPT). 

The acronym stands for BigScience Large Open-science Open-access Multilingual Language Model. The name makes it immediately clear that its creators believe in doing open science, in open access to scientific tools and knowledge, and in open source infrastructures. The stakes for open-sourcing such algorithms are high. The generalized character of large language models means that they can be applied to a broad range of generative tasks. And this in turn determines their power, but also means that they potentially can be applied in contexts removed from those intended by developers – raising the stake for permissive licensing approaches that follow the philosophy of “permission given in advance”.  

Generative tools – foundation model, but also an operating system, a browser, a content publishing framework or an encyclopedia, are prime candidates for being openly shared – they have the potential of being widely used, often in innovative ways. But in the case of machine learning, the generative character of the technologies has also raised concerns over ways to ensure responsible, or ethical uses. 

The decision to make access to the BLOOM model code open should be seen in the context of how other models are made available. Most famously, GPT-2, the first model released by OpenAI, was not shared openly, due to concerns about the ability of these models to “generate deceptive, biased, or abusive language at scale”. 

The company opted for a staged release, with the full model made available only after a year (during which there were few signs of harmful use). Nevertheless, the next version of the model, GPT-3, was made available only through a permissioned API (and in parallel preferential use of the technology as exclusively licensed to Microsoft). 

In contrast to this, the research community behind BLOOM has decided to release the model publicly – despite the fact that they acknowledge similar challenges as the creators of GPT-2 and GPT-3. The release notes for the model mention a broad range of challenges related to AI fairness, transparency, explainability and robustness, as well as the impact on privacy, accountability, addiction, manipulation, and misuse.

Creators of previous models have seen these challenges as a reason to either keep a model closed (as is the case with most image generation models), or provide permissioned access through an API. Admittedly, their decisions – framed in terms of ethical concerns – are at the same time an outcome of business considerations. In turn, creators of BLOOM have opted for an Open Access approach, firmly believing in norms of open sharing.  Yet afterward, treating this norm as a foundation, they searched for ways of enforcing responsible uses of AI technologies. And in order to achieve this, they decided to introduce behavior restrictions

“We feel that there is a balance to be struck between maximizing access and use of LLMs on the one hand, and mitigating the risks associated with use of these powerful models, on the other hand, which could bring about harm and a negative impact on society. […] Whereas the principles of ‘openness’ and ‘responsible use’ may lead to friction, they are not mutually exclusive, and we strive for a balanced approach to their interaction”.

While this analysis focuses on the BLOOM model, two other models were recently shared under permissive licenses. Also in July, Meta AI shared publicly OPT, its own large language model. The OPT-175B model is released under a bespoke license that limits uses to non-commercial, research uses. It includes additional use restrictions covering biometric processing, nuclear technologies, and any military or surveillance purposes. The company itself has not framed the release in the language of open source or open access: “By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, before broader commercial deployment occurs.” And in turn in August a large text-to-image model, Stable Diffusion, was released under a license that is a derivative of the RAIL license. 

Open licenses with behavioral restrictions

The creators of the RAIL (Responsible AI License) license state that they “opted to design an open and permissive license that also includes use-based restrictions”. The starting point is the Apache 2.0 license, one of the popular open-source licenses. On top of it, a broad range of behavioral (or use-based) restrictions is introduced. These are aimed to enforce a range of norms of responsible, or ethical AI use. 

The license white paper frames this as a response to the inability of other mechanisms, like ethical guidelines, to enforce these norms: “We argue that licenses could serve as a useful strategy for creating, expanding, and enforcing responsible behavioral norms given the limitations of self-regulation and governmental legislation. […] we advocate for the use of licensing as a mechanism for enabling legally enforceable responsible use”. 

The debate about effectiveness, and efficiency of licensing and regulation as two different forms of achieving public interest goals is well known in the Open Movement. Already in 2013, Creative Commons published a position on copyright reform that acknowledged the two approaches as complementary. This issue will continue to be relevant as the debate on the European AI Act, and its capacity to shape the development of AI technologies continues. 

The use restrictions are included as an attachment to the license, which lists thirteen such restrictions. Their range is very broad, and overlaps with issues raised previously not just in ethical guidelines, but also in regulatory debates, such as those on the European AI Act. The license restricts, for example, uses that violate laws and regulations, uses that exploit or harm minors, or uses that discriminate or harm “individuals or groups based on social behavior or known or predicted personal or personality characteristics”. There is also a requirement to disclaim that text created with BLOOM is machine-generated, or a restriction on any “medical advice and medical results interpretation”, which differs from other restrictions due to a broad scope that also restricts uses that are not unethical.

Rail license and the landscape of open licensing

The RAIL license was developed by AI researchers, in the context of debates on resource sharing, but also governance and responsibility, taking place in their community. It is interesting that open sharing is taken as given, and the tools to do so receive little attention. It is the ethics of the use of AI technologies, and enforcement of responsible uses, that is the new design requirement and the focus of the license design process. 

At the same time, it is surprising that the compatibility of the two goals: open sharing and enforcement of responsible use were not investigated more critically. This balance is not necessarily easy to strike. The fact that these issues were not investigated might be due to the character of the initiative, which seems to be strongly rooted in the AI governance debate, and with no visible connections to open licensing frameworks and their stewards. And these are all issues that should be considered by the broader community of open advocates. 

Finding ways to achieve this balance has important consequences for the open licensing framework. Traditionally, designers and stewards of open licenses have been averse to introducing limitations to open licensing. And a strong current in the Open Movement argued in favor of using only the most permissive licenses. This argument was recently made in relation to licensing of face recognition datasets by John Weitzmann, legal counsel of Wikimedia Germany. He argues that use restrictions have overly broad limiting effects on use, and the same goals of fostering responsible use can be achieved with other tools than copyright licenses. 

The key issues raised by the new license concern license proliferation and governance,  open licensing and peer production, the theory of change behind open licensing (vis a vis other market players and their closed forms of managing information resources), and most importantly how open licensing frameworks themselves are understood. 

License proliferation

The addition of licensing criteria, including use limitations, has been criticized in the past as leading to license proliferation, which becomes especially problematic when there is an expectation that openly shared resources should be available for free reuse and recombination. This argument is more relevant with regard to RAIL and future, similar licenses than the more common critique: that such restrictions are unnecessary limitations on user freedoms, which are at the heart of the free/open philosophy. Already, the RAIL license has been modified for the release of the Stable Diffusion model: the restriction on fully automated decision making, and on use without disclaimers that content is machine-generated, were removed.

Authors of the RAIL license envision a developer-driven ecosystem that, through the use of licenses can “help democratize the definition of responsible use by enabling AI developers to incorporate permissive- and restrictive-use clauses, based on their view of how easy it may be to repurpose their system”. They propose tools that would enable a variety of licenses to be managed, including a license repository that could “help potential licensors select existing predefined licenses that align with their ethical principles” or modular license generators that would let developers “select clauses (or ethical principles) as well as other license elements, such as the terms of commercial/non-commercial distribution, description of penalties or conditions of violation etc, that they would like to apply with the release of their AI systems”

They point, as a reference point, to the multitude of open source licenses that have been developed over time, but do not address concerns that are usually raised in this context. And admittedly, the example of the “free market” of open licenses suggests that a similar approach can lead to the use of the new wave of open and responsible licenses, so that the standard licenses emerge in an evolutionary manner. In order to fully understand the challenges caused by such proliferation, it is necessary to closely examine the reuse and remixing practices in the field of AI. 

License governance

Creators of RAIL envision a decentralized system, in which AI developers are free to implement licenses with various restrictions, based on their ethical preferences. At the same time, use restrictions included in the RAIL license are the outcome of the collective work of the BigScience community. 

This leads to questions concerning license governance. In the case of the RAIL license,  no details have been provided regarding the process of selecting the restrictions, and no governance structure for the license has been established. As I noted before, some of the restrictions could be contested as unnecessary limitations, such as the full restriction on medical uses. And similarly, one could argue that further restrictions could be introduced for sensitive uses of AI, such as a broader restriction for educational uses.

This stands in contrast to the strong participatory ethos at the heart of collaborative work on the BLOOM model itself. (It is also possible that license development has been participatory, only not properly documented). 

This suggests that further development of open licenses like the RAIL license would benefit from stronger community governance. In the open licensing space, there is a prior example of the Open Definition advisory council, a democratic, grassroots body that stewarded the definition. The body has been dormant for a long time and should possibly be re-instated. Its first, major goal would be to revisit the definition in light of emerging new trends in open licensing, signaled by the RAIL license or the new family of Can’t Be Evil licenses

Open licensing and peer production

The BLOOM model is not just openly licensed, but also the result of a collaborative model that fits what Yochai Benkler called commons-based peer production. It is a model that is at the heart of many open initiatives, including Wikipedia or the Firefox browser. At the same time, over the last two decades, it became clear that there is no causal connection between open sharing and peer production: most open resources are most probably created in non-collaborative settings (there is, unfortunately, no empirical research on this available, and this probably varies between fields of open.

This is one more reason why the peer-produced BLOOM model is so important in defining the standard for open AI – and why it opens a much-needed conversation about open sharing. Jennifer Ding from Alan Turing Institute argues that the collaborative production and community governance of the model is as important as open sharing:

“BLOOM opens the black box not just of the model itself, but also of how LLMs are created and who can be part of the process. With its publicly documented progress, and its open invitation to any interested participants and users, the BigScience team has distributed the power to shape, criticize and run an LLM to communities outside big tech.”

Melissa Heikkilä, writing for the MIT Tech Review, believes that this approach enabled BLOOM creators to include so many languages in the model: the community made an effort to diversify, and then crowdsource data on less popular languages.

The importance of participatory governance of the dataset parallels the argument made above about license governance. There is a general trend to treat participatory approaches – traditionally seen as an issue beyond the scope of open licensing concerns – as crucial to the success of open initiatives. This is well visible in the space of data governance, where it has been proposed by Salome Viljoen and is being championed by organizations like Connected by Data

The theory of change for open licensing 

While the RAIL license is presented as an open license, it ultimately aims foremost to limit unethical uses (under the conditions of an openly shared resource). This distinction is crucial for understanding the potential impact of the public release of BLOOM.

In the past, one of the aims of sharing openly was to create viable alternatives in markets dominated by incumbents using closed, exclusive models of managing intellectual property. This was the case of Wikipedia, of the Linux operating system, the Apache server, the Wordpress CMS or the Firefox browser. In each case, the open alternative managed to gain at least a significant minority position, and in some cases become the dominant standard. The theory of action behind these interventions was that over time freely available resources can become dominant, as long as they are of sufficient quality.

The situation is different with the BLOOM model, with its broad range of use restrictions, aimed to curb harmful uses. Simply speaking, multiplying goods requires different systemic approaches than reducing harm. Authors of the RAIL license acknowledge challenges with license enforcement, but argue that it is still an improvement over the effectiveness of community norms or regulation. At the same time, there are already examples of licensing conditions being broken (together with default safety guidelines at the code level).

Even if license enforcement is successful, those wanting to develop solutions for these restricted uses will simply have to use other, closed language models for their projects. If the aim is to curb unethical and harmful uses, then this will be achieved not through the proliferation of openly shared models, but through the proliferation of licenses like RAIL among developers of other models – so that use restriction becomes a standard. 

The future of open licensing frameworks

The authors of the RAIL license acknowledge that the license does not meet the Open Source Initiative definition of open code licenses (and it does not meet the Open Definition either). In related news, the newly launched Can’t Be Evil licenses also challenged established open licensing models, while seeking to uphold the spirit of open sharing. 

There are previous examples of licenses that are part of accepted open licensing frameworks and that have included use restrictions. In particular, the family of Creative Commons licenses and tools – which are today the dominant content licensing framework – include licenses that restrict non-commercial uses. Admittedly, these licenses have been over the years the subject of heated debates in the Open Movement, as it aimed to set a licensing standard, to which the community will adhere.

The RAIL license, with its broader and more detailed scope of restrictions, can raise the same questions. Yet the question of whether the RAIL license is an open license according to the definitions, while possibly interesting for open licensing stalwarts, is not the most important one. The more important question is whether this license, if we agree that it correctly identifies challenges that open licensing faces today, spells a need to revisit open licensing frameworks. Traditionally, debates over what constitutes an open license were related to normative debates about ensuring user freedoms. Authors of the RAIL license rightly point out that these need to be balanced today with care for responsible uses. 

The introduction of use-based restriction of the scale included in the RAIL license should be seen as a significant change in the approach to open licensing. It is a sign of times that the crafters of the license saw the need to take into account not just the positive, democratizing effect of sharing, but also possible negative consequences – and in particular harms, and power imbalances. The issue should be considered beyond the emerging “open AI” community. And the key questions concern the need for such new, restricted licenses – which should be balanced with a clear understanding of situations, where the “traditional” open licenses are still fit for purpose. 

A review of open licensing frameworks should also take into account other, current developments. For example, the recently meant to be used for licensing copyrights related to NFTs, also signal a need to revisit open licensing – and necessitate a response from the Open Movement.



Comments
3
?
Tim Davies:

Really interesting exploration.

One key question it raised for me is How are models different from datasets, in terms of how they may be embedded or combined with other systems, and in terms of the political dynamics of opening up models vs. data?

I see there as two relatively strong cases that get made for the ‘unrestricted open license’ advocacy around open data:

(1) The value of data comes from connecting datasets. This makes license incompatibility particularly problematic when combining more than 2 datasets. Because of the mode of getting value from data, license proliferation introduces high overheads and costs.

I’m not sure the same dynamics occur with ML Models in most cases.

(2) Data holders are liable to ‘abuse’ license restrictions to further private interests. Particularly in the case of opening government data, restrictions on the kind of re-use allowed can limit political freedoms etc.

It seems license restrictions on ML models could fall into this trap (allowing the model owner full power to use the model, but only licensing some of this power to others: e.g. FB restrictive model that allows them to commercialise, but prohibits others from doing this), but they may also be aiming to create a level playing field in which no-one is abusing the power of the model.

I’m not sure rebooting the Open Definition Advisory Council is entirely the way forward (or at least, it would need substantially broader base to deal with the issues at hand to ‘define’ openness in the context of AI), but I am struck by the implicit point above that the norms of licenses are not enacted just by being written down, but need conversation and community around them in order to understood, and to have some binding power on those they seek to affect.

Alek Tarkowski:

Thanks for all this feedback.

Regarding differences between datasets and models, this is a key issue. Personally there’s some level of detail at which I lack the technical expertise to undertstand how models are “used”. Indeed, do they get remixed, are there same affordances that open models help unlock? I hope to learn more about this with the help of AI experts.

On the last point, I’m not insisting on rebooting that particular group, although I like the idea of building on what has previously been done. This could be a creative reboot, and you’re right, my main point is that that was an effort at having community participation in framing what are open frameworks, and where are their boundaries.

?
Tim Davies:

Potentially worth recognising that this has just been one strand of license work over the last decade.

For example, indigenous communities in particular have worked to develop Traditional Knowledge licenses (see https://creativecommons.org/2018/09/18/traditional-knowledge-and-the-commons-the-open-movement-listening-and-learning/ etc.) that do introduce limitations.

Arguably the focus on a very strict open definition binary has been a relatively anglo or euro-centric concern.

Alek Tarkowski:

I agree. And in turn, at an earlier stage, this binary distinction was a way to enforce a strong vision of “Free culture”, similar to “free software” (vs open source). There are, for example, educational networks where the standard of “openness” includes non-commercial limitations.
So you are right that this requires some nuancing. Though I do think that it represents a view point that’s at least prevalent.

?
Javier Ruiz:

Excellent discussion. I think that one big area to unpack is how these AI digital systems differ from traditional software (see https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses) and how far downstream do you need to go to achieve the benefits of “open”, e.g. training data and documentation.

Also, the Model licence mentions contributors, and as you raise in peer collaboration this is another major topic with the right agreements being critical to the long term viability of many projects.

Alek Tarkowski:

Thanks Javier! Indeed, it seems that this space can be mapped out in more detail. Maybe you’re up for an online workshop where we’d try to map this territory?