Skip to main content
SearchLoginLogin or Signup

Towards a Books Data Commons for AI Training

This paper, which has been informed by a series of workshop discussions, maps possible paths to building a books data commons and outlines key questions relevant to developers, repositories, and other potential stakeholders.

Published onApr 08, 2024
Towards a Books Data Commons for AI Training
·

1. Introduction1

While the field of artificial intelligence research and technology has a long history, broad public attention grew over the last year in light of the wide availability of new generative AI systems, including large language models (LLMs) like GPT-4, Claude, and LLaMA-2. These tools are developed using machine learning and other techniques that analyze large datasets of written text, and they are capable of generating text in response to a user’s prompts.

While many large language models rely on website text for training, books have also played an important role in developing and improving AI systems. Despite the widespread use of e-books and growth of sales in that market, books remain difficult for researchers and entrepreneurs to access at scale in digital form for the purposes of training AI.

In 2023, multiple news publications reported on the availability and use of a dataset of books called “Books3” to train LLMs2. The Books3 dataset contains text from over 170,000 books, which are a mix of in-copyright and out-of-copyright works. It is believed to have been originally sourced from a website that was not authorized to distribute all of the works contained in the dataset. In lawsuits brought against OpenAI, Microsoft, Meta, and Bloomberg related to their LLMs, the use of Books3 as training data was specifically cited.3

The Books3 controversy highlights a critical question at the heart of generative AI: what role do books play in training AI models, and how might digitized books be made widely accessible for the purposes of training AI? What dataset of books could be constructed and under what circumstances?

In February 2024, Creative Commons, Open Future and Proteus Strategies convened a series of workshops to investigate the concept of a responsibly designed, broadly accessible dataset of digitized books to be used in training AI models. Conducted under the Chatham House Rule, we set out to ask if there is a possible future in which a “books data commons for AI training” might exist, and what such a commons might look like. The workshops brought together practitioners on the front lines of building next-generation AI models, as well as legal and policy scholars with expertise in the copyright and licensing challenges surrounding digitized books. Our goal was also to bridge the perspective of stewards of content repositories, like libraries, with that of AI developers. A “books data commons” needs to be both responsibly managed, and useful for developers of AI models.

We use “commons” here in the sense of a resource that is broadly shared and accessible, and thus obviates the need for each individual actor to acquire, digitize, and format their own corpus of books for AI training. This resource could be collectively and intentionally managed, though we do not mean to select a particular form of governance in this paper.4

This paper is descriptive, rather than prescriptive, mapping possible paths to building a books data commons as defined above and key questions relevant to developers, repositories, and other stakeholders, building on our workshop discussions. We first explain why books matter for AI training and how broader access could be beneficial. We then summarize two tracks that might be considered for developing such a resource, highlighting existing projects that help foreground both the potential and challenges. Finally, we present several key design choices, and next steps that could advance further development of this approach.5

2. Basics of AI Training and Technical Challenges of Including Books

It’s critical to understand that LLMs are not trained on text “as is” – meaning that the model is not digesting the text in a way humans would, front to back. The text does not represent a copy of the original text in its original form. Instead, the text is processed in smaller chunks of text, which are then shuffled and “tokenized,” as we explain further below.

One way to conceptualize the chunking, shuffling and tokenizing process is to imagine a 900 page book, which has 400,000 words. To feed into an AI model, the book will first be cut into manageable chunks of text that represent up to several thousand tokens; such a process might result in around 50 “chunks” of text. Each of those chunks will contain long sections of narrative content; however, the chunks themselves will then be randomized, and fed into the AI model out of sequence from each other; the first chunk may be text from Chapters 9 and 10, while the initial text in Chapter 1 may be in the 30th chunk. Within these chunks, the text itself will be understood by the model as tokens.

In the example below, 252 characters of human-readable text are shown in tokenized form as 57 distinct tokens, the relationships between which then form the basis of building an AI model. The illustration shows a block of human-readable text as it would be tokenized for AI training; different colors are used in this visualization merely to differentiate one token from another within the string of text. As the visualization makes clear, not all of the tokens directly correspond to a single word; tokens merely represent characters that often appear together in the training data.6

Tokens do not typically represent words, but instead often represent subword tokens. For example the word “incompetence” may be broken into three tokens: “in-,” “competent,” and “-ence.” This approach to tokenization enables representation of grammar and word variations, effectively allowing a high degree of language generalizability.7

In recent years, LLM research has successfully been able to scale up models by pre-training on a large number of tokens. In turn, this has allowed a higher degree of language generalizability in the resulting model. For example, OpenAI’s ChatGPT trained on hundreds of billions of tokens, allowing it to model language in a very general way. The resulting models an then be fine-tuned for specific tasks using training data representing a particular corpus, such as software code.8

3. Why Books are Important to Training AI

Despite the proliferation of online content and some speculating that books would simply die out with the advent of the Internet9, books remain a critical vehicle for disseminating knowledge. The more scientists study how books can impact people, the less surprising this is. Our brains have been shown to interact with longform books in meaningful ways: we develop bigger vocabularies when we read books; we develop more empathy when we read literary fiction; and connectivity between different regions of our brain increases when we read.10

In that light, it might be unsurprising that books are important for training AI models. A broadly accessible books dataset could be useful not only for building LLMs, but also for many other types of AI research and development.

Performance and Quality

The performance and versatility of an AI model can significantly depend on whether the training corpus includes books or not. Books are uniquely valuable for AI training due to several characteristics.

  • Length: Books tend to represent longer-form content, and fiction books, in particular, represent long-form narrative. An AI trained on this longer-form, narrative type of content is able to make connections over a longer context, so instead of putting words together to form a single sentence, the AI becomes more able to string concepts together into a coherent whole; even after a book is divided into many “chunks” before the process of tokenization, that will still provide long stretches of text that are longer than the average web page. While Web documents, for instance, tend to be longer than a single sentence, they are not typically hundreds of pages long like a book.

  • Quality: The qualities of the training data impact the outputs a tool can produce. Consider an LLM trained on gibberish; it can learn the patterns of that gibberish and, in turn, produce related gibberish, but will not be very useful for writing an argument or a story, for instance. In contrast, training an LLM on books with well-constructed arguments or crafted stories could serve those purposes. While “well-constructed” and “crafted” are necessarily subjective, the traditional role of editors and the publishing process can provide a useful indicator for the quality of writing inside of books. What’s more, metadata for books — information such as the title, author and year of publication — is often more comprehensive than metadata for information found on the web, and this additional information can help contextualize the provenance and veracity of information.

  • Breadth, Diversity, and Mitigating Bias: Books can serve a critical role in ensuring AI models are inclusive of a broad range of topics and categories that may be under-represented in other content. For all that the Internet has generated an explosion in human creativity and information sharing, it generally represents only a few decades of information and a small portion of the world’s creative population. A books dataset, by comparison, is capable of representing centuries of human knowledge. As a result such a dataset can help ensure AI systems behavior is based on centuries of historical information from modern books. It can help ensure broad geographic and linguistic diversity. What’s more, the greater breadth and diversity of high-quality content help mitigate challenges around bias and misinformation. Using a more diverse pool of training data can help support the production of a model and outputs of the model that are more representative of that diversity. Books can be useful in evaluation datasets to test existing models for memorization capabilities, which can help prevent unintended reproduction of existing works. Of course, this is all contingent on actual composition of the corpus; in order to have the benefits described, the books would need to be curated and included with characteristics like time, geographic and linguistic diversity.

  • Other Modalities: Finally, books do not just contain text, they often contain images and captions of those images. As such, they can be an important training source for multi-modal LLMs, which can receive and generate data in media other than text.

Lowering Barriers to Entry & Facilitating Competition

Broad access to books for AI training is critical to ensure powerful AI models are not concentrated in the hands of only a few companies. Access to training data, in general, has been cited as a potential competitive concern in the AI field because of the performance benefits to be gained by training on larger and larger datasets. But this competitive wedge is even more acute when we look specifically at access to book datasets.11

The largest technology companies building commercial AI models have the resources and capacity to mass digitize books for AI training. Google has scanned 40 million books, many of which came from digitization partnerships they formed with libraries. They may already use some or all of these books to train their AI systems.12 It’s unclear to what extent other companies already have acquired books for AI training (for instance, whether Amazon’s existing licenses with publishers or self-published authors may permit such uses); regardless, comparable efforts to Google’s would cost many hundreds of millions of dollars.13

Independent researchers, entrepreneurs, and most other businesses and organizations are unlikely to have the resources required to digitally scan millions of books nor purchase licenses to digitized books in ways that could unlock the benefits described above. Ensuring greater competition and innovation in this space will require making this type of data available to upstarts and other entities with limited resources. A well-designed and appropriately governed digital books commons is one way to do that.

4. Copyright, Licensing, & Access to Books for Training

Even if books can be acquired, digitized, and made technically useful for AI training, the development of a books data commons would necessarily need to navigate and comply with copyright law.

Out-of-Copyright Books: A minority of books are old enough to be in the public domain and out of copyright, and an AI developer could use them in training without securing any copyright permission. In the United States, all books published or released before 1929 are in the public domain. While use of these books provides maximal certainty for the AI developer to train on, it is worth noting that the status of whether a book is in the public domain can be difficult to determine.14 For instance, books released between 1929 and 1963 in the U.S. are out of copyright if they were not subject to a copyright renewal; however, data on copyright renewals is not easily accessible.

What’s more, copyright definitions and term lengths vary among countries. Even if a work is in the public domain in the US, it may not be in other countries.15 Countries generally use the life of the last living author + “x” years to determine the term of copyright protection. For most countries, “x” is either 50 years (the minimum required by the Berne Convention) or 70 years (this is the case for all member states of the European Union and for all works published in the U.S. after 1978). This approach makes it difficult to determine copyright terms with certainty because it requires information about the date of death of each author, which is often not readily available.

In-Copyright Books: The vast majority of books are in copyright, and, insofar as the training process requires making a copy of the book, the use in AI training may implicate copyright law. Our workshop covered three possible paths for incorporating such works.

Direct licensing

One could directly license books from rightsholders. There may be some publishers who are willing to license their works for this purpose, but it is hard to determine the scale of such access, and, in any event, there are significant limits on this approach. Along with the challenge (and expense) of reaching agreements with relevant rightsholders, there is also the practical difficulty of simply identifying and finding the rightsholder that one must negotiate with. The vast majority of in-copyright books are out-of-print or out-of-commerce, and most are not actively managed by their rightsholders. There is no official registry of copyrighted works and their owners, and existing datasets can be incomplete or erroneous.16

As a result, there may be no way to license the vast majority of in-copyright books, especially those that have or have had limited commercial value.17 Put differently, the barrier to using most books is not simply to pay publishers; even if one had significant financial resources, licensing would not enable access to most works.

Permissively licensed works

There are books that have been permissively licensed in an easily identifiable way, such as works placed under Creative Commons (CC) licenses. Such works explicitly allow particular uses of works subject to various responsibilities (e.g., requiring attribution by the user in their follow-on use).

While such works could be candidates for inclusion in a books data commons, their inclusion depends on whether the license’s terms can be complied with in the context of AI training. For instance, in the context of CC licensed works, there are requirements for proper attribution across all licenses (the CC tools Public Domain Dedication (CC0) and Public Domain Mark (PDM) are not licenses and do not require attribution).18

Even if a book is in copyright, it’s possible that copying books for AI training may be covered by existing limitations and exceptions to copyright law in particular jurisdictions. For example:

In the United States, many argue using existing works to train generative AI is “fair use,” consistent with existing law and legal precedents.19 This is the subject of a number of currently active court cases, and different actors and tools may yield different results, as fair use is applied case-by-case using a flexible balancing test.

In the European Union, there are explicit exceptions in the law for “text and data mining” uses of in-copyright works, both for non-commercial research and for commercial purposes. However, for commercial uses and for users outside of research and heritage institutions, they must respect the rights of rightsholders who choose to “reserve their rights” (i.e., opt-out of allowing text and data mining) via machine readable mechanisms.20 The exception also requires that users have “lawful access” to the works.

Finally, Japan provides a specific text and data mining exception, without any comparable opt-out requirement for commercial uses as is embedded in EU law.21

While exceptions that allow AI training exist in several other countries, such as Singapore and Israel, most countries do not provide exceptions that appear to permit AI training. Even where potentially available, as in the United States, legal uncertainty and risk create a hurdle for anyone building a books commons.22

It is also important to note two other issues that can affect the application of limitations and exceptions, in particular, their application to e-books.

The first important limitation is that almost every digital book published today comes with a set of contractual terms that restrict what users can do with it. In many cases, those terms will explicitly restrict text data mining or AI uses of the content, meaning that even where copyright law allows for reuse (for example, under fair use), publishers by contract can impose restrictions anyway. In the United States, those contract terms are generally thought to override the applicability of fair use or other limitations and exceptions.23 Other jurisdictions, such as those in the EU, provide that certain limitations and exceptions cannot be contractually overridden, though experience to date varies with how those anti-contractual override protections work in practice.24

The second limitation is the widespread adoption of “anti-circumvention” rules in copyright laws and the interplay of these with a choice to rely on copyright limitations and exceptions. Digital books sold by major publishers are generally encumbered with “digital rights management” (DRM) that limits how someone can use the digital file. For instance, DRM can limit the ability to make a copy of the book, or even screenshot or excerpt from it, among other things. Anti-circumvention laws restrict someone's ability to evade these technical restrictions, even if it is for an ultimately lawful use.

What this means for our purposes is that even if one acquires a digital book from, for example, Amazon, and it is lawful under copyright law to use that book in AI training, it can still generally be unlawful to circumvent the DRM to do so, outside narrow exceptions. Thus, the ability to use in-copyright books encumbered by DRM 25— that is, most all books sold by major publishers — is generally limited.26

Practically, using in-copyright books to build a books commons for AI training — while relying on copyright’s limitations and exceptions — requires turning a physical book into digital form, or otherwise engaging in the laborious process of manually re-creating a book’s text (i.e., re-typing the full text of the book) without circumventing the technical restrictions themselves.

5. Examining approaches to building a books data commons

There are many possible permutations for building a books data commons. To structure our exploration, we focused on two particular tracks, discussed below. We chose these tracks mindful of the above legal issues, and because there are already existence proofs that help to illuminate tradeoffs, challenges and potential paths forward for each.

5a. Public domain and permissively licensed books

Existing Project Example27: The Pile v2

In 2020, the nonprofit research group EleutherAI constructed and released The Pile — a large, diverse, open dataset for AI training. EleutherAI developed it not only to support their own training of LLMs, but also to lower the barriers for others.28

Along with data drawn from the web at large, The Pile included books from three datasets. The first dataset was the Books3 corpus referenced at the outset of this paper. The second and third books datasets were smaller: BookCorpus2, which is a collection of 17,868 books by otherwise unpublished authors; and a 28,752 books in the public domain and published prior to 1919, drawn from a volunteer effort to digitize public domain works called Project Gutenberg.

As the awareness about The Pile dataset grew, certain rightsholders began sending copyright notices to have the dataset taken down from various websites.

Despite the takedown requests, the importance of books to EleutherAI and the broader community’s AI research remained. In hoping to forge a path forward EleutherAI announced in 2024 that they would create a new version of the dataset, which they will call The Pile v2.29 Among other things, v2 would “have many more books than the original Pile had, for example, and more diverse representation of non-academic non-fiction domains.” At the same time, it would only seek to include public domain books and permissively licensed content. As before, this corpus focuses on English language books.

Implications of the The Overall Approach

Stepping back from The Pile v2 specifically, or any particular existing collection of books or dataset built on their basis, we want to understand the implications of relying on public domain works and expressly licensed works in building a books commons.

The benefits are relatively straightforward. Both categories, by definition come with express permission to use the books in AI training. The cost of acquiring the books for this use may be effectively zero or close to it, when considering public domain and “openly” licensed books that allow redistribution and that have already been digitized.

But this approach comes with some clear limitations. First, as noted above, for many books in the public domain, their status as such is not always clear. And with respect to permissively licensed books, it is not always clear whether and how to comply with the license obligations in this context.

Setting aside those challenges, the simple fact is that relying on public domain and existing permissively licensed books would limit the quantity and diversity of data available for training, impacting performance along different dimensions. Only a small fraction of books ever published fall into this category, and the corpus of books in this category is likely to be skewed heavily towards older public domain books. This skew would, in turn, impact the content available for AI training.30 For instance, relying on books from before 1929 would not only incorporate outdated language patterns, but also a range of biases and misconceptions about race and gender, among other things. Efforts could be made to get people to permissively license more material — a book drive for permissive licensing, so to speak; this approach would still not encompass most books, at least when it comes to past works.31

5b. Limitations & Exceptions

Existing Project Example: HathiTrust Research Center (HTRC)

The HathiTrust Research Center provides researchers with the ability to perform computational analysis across millions of books. While it is not suited specifically for AI training, it is an existence proof for what such a resource might look like.

It is also an example predicated on copyright’s limitations and exceptions — in this case, on U.S. fair use. While the Authors Guild filed a copyright infringement suit against HathiTrust, federal courts in 2012 and 2014 ruled that HathiTrust’s use of books was fair use.32

A nonprofit founded in 2008, HathiTrust grew out of a partnership among major US university libraries and today is “an international community of research libraries committed to the long-term curation and availability of the cultural record.”33 It started in what it calls the “early days of mass digitization” — that is, at a time when it started to become economical to take existing physical artifacts in libraries and turn them into digital files at a large scale.

The founding members of HathiTrust were among the initial partners for Google’s Book Search product, which allows people to search across and view small snippets of text from in-copyright books34 and read full copies of public domain books scanned from libraries’ collections. The libraries provided Google with books from their collections, Google would then scan the books for use in Book Search, and return to the libraries a digital copy for their own uses. These uses included setting up HathiTrust not only to ensure long-term preservation of the digital books and their metadata, but also to facilitate other uses, including full text search of books and accessibility for people with print disabilities. In separate court cases, both Google and HathiTrust’s uses of the books were deemed consistent with copyright law.

The uses most relevant to this paper are those enabled by what HathiTrust refers to today as the Research Center. The Center grew in part out of a research discipline called “digital humanities,” which, among other things, seeks to use computational resources or other digital technologies to analyze information and contribute to the study of literature, media, history, and other areas. For instance, imagine you want to understand how a given term (e.g., “war on drugs”) became used; one might seek to analyze when the term was first used and how often it was used over time by analyzing a vast quantity of sources, searching out the term’s use. The insight here is that there is much to be learned not just from reading or otherwise consuming specific material, but also from “non-consumptive research,” or "research in which computational analysis is performed on one or more volumes (textual or image objects)" to derive other sorts of insights. AI training is a type of non-consumptive use.

Today, the Center “[s]upports large-scale computational analysis of the works in the HathiTrust Digital Library to facilitate non-profit and educational research.” It includes over 18 million books in over 400 languages from the HathiTrust Digital Library collection. Roughly 58% of the corpus is in copyright. HathiTrust notes that, while this corpus is large, it has limitations in terms of its representation across subject matter, language, geography, and other dimensions. In terms of subject matter, the corpus is skewed towards humanities (64.9%) and social sciences (14.3%). In terms of language, 51% of the books are in English, German is the next-largest language represented at 9%, and is followed by a long-tail of languages by representation.

In order to enable these uses, HathiTrust has invested in technical solutions to prevent possible misuse. To some extent, they manage this by limiting who gets access to the Center, and limiting access to specific features to researchers at member institutions. HathiTrust has also put in place various security controls on both the physical storage of the digitized books and the network access to those files. The primary uses of the data through the Research Center includes access to an extracted features set and access to the complete corpus “data capsule,” which is a virtual machine running on the Center’s servers. The data capsule allows users to conduct non-consumptive research with the data, but it limits the types of outputs allowed in order to prevent users from obtaining full content of in-copyright works. The measures taken include physical security controls on the data centers housing the information, as well as restrictions via network access and encryption of backup tapes. In the finding that HathiTrust use was a fair use and thus rejecting a lawsuit brought by the Authors Guild, the Court noted the importance of these controls.35

Today, the Center’s tools are not suitable for AI training, in that they don’t allow the specific types of technical manipulation of underlying text necessary to train an AI. Nevertheless, the Center demonstrates that building a books data commons for computational analysis is possible, and in turn points to the possibility of creating such a resource for AI training.36

Implications of Overall Approach

By relying on existing limitations and exceptions in copyright law, the number of books one could include in the corpus of a books data commons is far greater and more diverse. Of course, a bigger dataset doesn’t necessarily mean a higher quality dataset for all uses of AI models; as HathiTrust shows, even a multimillion book corpus can skew in various directions. Still, dataset size generally remains significant to an LLM’s performance – the more text one can train on, or rather the more tokens for training the model, the better, at least along a number of performance metrics.37

While holding the potential for a broader and more diverse dataset, a key limitation in pursuing this approach is that it is only feasible where relevant copyright limitations and exceptions exist. Even then, legal uncertainty means that going down this path is likely to generate, at a minimum, expensive and time-consuming litigation and regulatory engagement. And, at least in the U.S., it could generate billions of dollars in damages if the specific design choices and technical constraints are not adequate to justify a finding of fair use.

This sort of books dataset could be built by expanding use of in-copyright books that have already been digitized from existing libraries and other sources. Specifically, workshop participants mentioned that the Internet Archive, HathiTrust, and Google as entities that have digitized books and could repurpose their use to build a books commons, although challenges with using these datasets were noted. The Internet Archive is in the midst of litigation brought by book publishers for its program for lending digital books; while not directly relevant to the issue of AI training using their corpus of books, this sort of litigation creates a chilling effect on organizations seeking to make new uses of these digitized books. Meanwhile, Google encumbered HathiTrust’s digital copies with certain contractual restrictions, which would need to be addressed to develop a books dataset for AI training, and Google itself is unlikely to share its own copies while it provides them a competitive advantage.

Perhaps as a matter of public policy, these existing copies could be made more freely available. For instance, to ensure robust competition around AI and advance other public interests, policymakers could remove legal obstacles to the sharing of digitized book files for use in AI training. Alternatively, policymakers could go further and affirmatively compel sharing access to these digital book files for AI training.

It's possible that there could be a new mass digitization initiative, turning physical books into new digital scans. At least in theory, one could try to replicate the existing corpora of HathiTrust, for example, without Google’s contractual limitations. At the same time, such an effort would take many years, and it seems unlikely that many libraries would want to go to the trouble to have their collections digitized a second time. Moreover, while new scans may provide some incremental benefit over use of existing ones (e.g., by using the most modern digitization and OCR tools and thus improving accuracy), there is no inherent social value to making every entity that wants to do or allow AI training invest in their own redundant scanning.

A new digitization effort could target works that have not been yet digitized. This may be particularly useful given that previous book digitization efforts, and the Google Books project in particular, have focused heavily (though not exclusively) on libraries in English-speaking countries. Additional digitization efforts might make more sense for books in those languages that have not yet been digitized at a meaningful scale. Any new digitization effort might therefore start with a mapping of the extent to which a books corpus in a given language has been digitized.

6. Cross-cutting design questions

The workshops briefly touched on several cross-cutting design questions. While most relevant for approaches that depend on limitations and exceptions, considerations of these questions may be relevant across both tracks.

Would authors, publishers, and other relevant rightsholders and creators have any ability to exclude their works?

One of the greatest sources of controversy in this area is the extent to which rightsholders of copyrighted works, as well as the original creators of such works (e.g., book authors in this context), should be able to prevent use of their works for AI training.

While a system that required affirmative “opt-in” consent would limit utility significantly (as discussed above in the context of directly licensing works), a system that allowed some forms of “opt-out” could still be quite useful to some types of AI development. In the context of use cases like development of LLMs, the performance impact may not be so significant. Since most in-copyright books are not actively managed, the majority of books would remain in the corpus by default. The performance of LLMs can still be improved across various dimensions without including, for example, the most famous writers or those who continue to commercially exploit their works and may choose to exercise an opt-out. Perhaps the potential for licensing relationships (and revenue) may induce some rightsholders to come forward and begin actively managing their works. In such a case, uses that do require a license may once again become more feasible once the rightsholder can be reached.

Workshop participants discussed different types of opt-outs that could be built. For example, opt-outs could be thought of not in blanket terms, but only as applied to certain uses, for example to commercial uses of the corpus, but not research uses. This could build on or mirror the approach that the EU has taken in its text and data mining exceptions to copyright.38 Opt-outs might be more granular, by focusing on allowing or forbidding particular uses or other categories of users, given that rights holders have many different sets of preferences.

Another question is about who can opt-out particular works from the dataset. This could solely be an option for copyright holders, although authors might be allowed to exercise an opt-out for their books even if they don’t hold the copyrights. This might create challenges if the author and rightsholder disagree about whether to opt a particular book out of the corpus. Another related issue is that individual books, such as anthologies, may comprise works created (and rights held) by many different entities. The images in a book may have come from third-party sources, for instance, or a compendium of poetry might involve many different rightsholders and authors. Managing opt-outs for so many different interests within one book may get overly complicated very fast.

In any event, creating an opt-out system will need some ways of authenticating whether someone has the relevant authority to make choices about inclusion of a work.

Who would get to use the books data commons? For what?

A commons might be made publicly available to all, as has been done with datasets like The Pile. Another possible design choice is to restrict access only to authorized users and to enforce particular responsibilities or obligations in return for authorization. Three particular dimensions of permitted uses and users came up in our discussions:

  • Defining and ensuring acceptable and ethical use: Participants discussed to what extent restrictions should be put on use of the resource. In the case of HathiTrust, acceptable use is implicitly ensured by limiting access to researchers from member institutions; other forms of “gated access” are possible, allowing access only to certain types of users and for certain uses.39 One can imagine more fine-grained mechanisms, based on a review of the purpose for which datasets are used. This imagined resource could become a useful lever to demand responsible development and use of AI; alongside “sticks” like legal penalties, this would be a “carrot” that could incentivize good behavior. At the same time, drawing the lines around, let alone enforcing, “good behavior” would constitute a significant challenge.

  • Charging for use to support sustainability of the training corpus itself: While wanting to ensure broad access to this resource, it is important to consider economic sustainability, including support for continuing to update the resource with new works and appropriate tooling for AI training. Requiring some form of payment to use the resource could support sustainability, perhaps with different requirements for different types of users (e.g., differentiating between non-commercial and commercial users, or high-volume, well-resourced users and others).40

  • Ensuring benefits of AI are broadly shared, including with book authors or publishers: The creation of a training resource might lower barriers to the development of AI tools, and in that way support broadly shared benefits by facilitating greater competition and mitigating concentration of power. On the other hand, just as concentration of technology industries is already a significant challenge, AI might not look much different, and the benefits of this resource may still simply go to a few large firms in “winner takes all-or-most” markets. The workshops discussed how, for instance, large commercial users might be expected to contribute to a fund that supported contributors of training data, or more generally to fund writers, to ensure everyone contributing to the development of AI benefits.

What dataset management practices are necessary?

No matter how a books data commons gets built, it will be important to consider broader aspects of data governance. For example:

  • Dataset documentation and transparency: Transparent documentation is important for any dataset used for AI training. A datasheet is a standardized form of documentation that includes information about provenance and composition of data, and includes information on management practices, recommended uses or collection process.

  • Quality assurance: Above, we note the many features that make books useful for AI training, as compared with web data, for example. That said, the institution managing a books commons dataset may still want to collect and curate the collection to meet the particular purposes of its users. For instance, it may want to take steps to mitigate biases inherent in the dataset, by ensuring books are representative of a variety of languages and geographies.

  • Understanding uses: The institution managing a books commons dataset could measure and study how the dataset is used, to inform future improvements. Such monitoring may also enable accountability measures with respect to uses of the dataset. Introducing community norms for disclosing datasets used in AI training and other forms of AI research would facilitate such monitoring.

  • Governance mechanisms: In determining matters like acceptable and ethical use, the fundamental question is “who decides.” While this might be settled simply by whoever sets up and operates the dataset and related infrastructure, participatory mechanisms — such as advisory bodies bringing together a broad range of users and stakeholders of a collection — could also be incorporated.

7. Conclusion

This paper is a snapshot of an idea that is as underexplored as it is rooted in decades of existing work. The concept of mass digitization of books, including to support text and data mining, of which AI is a subset, is not new. But AI training is newly of the zeitgeist, and its transformative use makes questions about how we digitize, preserve, and make accessible knowledge and cultural heritage salient in a distinct way.

As such, efforts to build a books data commons need not start from scratch; there is much to glean from studying and engaging existing and previous efforts. Those learnings might inform substantive decisions about how to build a books data commons for AI training. For instance, looking at the design decisions of HathiTrust may inform how the technical infrastructure and data management practices for AI training might be designed, as well as how to address challenges to building a comprehensive, diverse, and useful corpus. In addition, learnings might inform the process by which we get to a books data commons — for example, illustrating ways to attend to the interests of those likely to be impacted by the dataset’s development.41

While this paper does not prescribe a particular path forward, we do think finding a path (or paths) to extend access to books for AI training is critical. In the status quo, large swaths of knowledge contained in books are effectively locked up and inaccessible to most everyone. Google is an exception — it can reap the benefits of their 40 million books dataset for research, development, and deployment of AI models. Large, well-resourced entities could theoretically try to replicate Google’s digitization efforts, although it would be incredibly expensive, impractical, and largely duplicative for each entity to individually pursue their own efforts. Even then, it isn’t clear how everyone else — independent researchers, entrepreneurs, and smaller entities — will have access. The controversy around the Books3 dataset discussed at the outset should not, then, be an argument in favor of preserving the status quo. Instead, it should highlight the urgency of building a books data commons to support an AI ecosystem that provides broad benefits beyond the privileged few.

Comments
0
comment
No comments here
Why not start the discussion?