Skip to main content
SearchLoginLogin or Signup


A white paper on understanding the implications of face recognition training with CC-licensed photographs

Published onSep 28, 2022

This white paper presents the case of using openly licensed photographs for AI facial recognition training datasets. The analysis is part of AI_Commons, our activity that explores how AI training datasets, and works included in those datasets, can be better governed and shared as a commons.

The case creates an opportunity to ask fundamental questions about the challenges that open licensing faces today, related to privacy, exploitation of the commons at massive scales of use, or dealing with unexpected and unintended uses of works that are openly licensed.

While events that form this case go back almost a decade, these issues are still relevant . The lessons that we can draw from this case are applicable today, and can help to govern AI training datasets, and other elements of the AI technological stack. The case also creates an opportunity to review open licensing frameworks and to make them future-proof.

We are currently soliciting feedback on this white paper and are in particular interested in:

  • identifying solutions to the challenges raised by the case;

  • understanding how insights from this case can be translated to ongoing AI dataset governance debates.

Please share feedback directly in the publication or by writing to Alek Tarkowski ([email protected]).

(Research support: Francesco Vogelezang
Cover illustration: Jakub Koźniewski / PanGenerator, CC BY)

The gist of this case

In 2002, twenty years ago, the Creative Commons licenses were created. These legal tools provided standardized means for content sharing through limited, flexible copyrights.

In 2004 Flickr became one of the first social media platforms and the go-to place for publishing photos on the Web. It was one of the early adopters of Creative Commons.

By 2014, there were almost 400 million CC-licensed photos on Flickr. That year researchers from Yahoo Labs, Lawrence Livermore National Laboratory, Snapchat and In-Q-Tel used a quarter of all these photos to create YFCC100M, a dataset of 100 million photographs of people created for computer vision applications.

Until today, this dataset remains one of the most significant examples of openly licensed content reusing. Because of the massive scale and the productive nature of the dataset, it became one of the foundations for computer vision research and industry built on top of it.

The YFCC100M dataset has set a precedent, followed by many other datasets. Some are designed as samples of the original one, others copying its approach to content provenance. Many of them became standardized tools used for training facial recognition AI technologies.

In 2019, research by Adam Harvey put the spotlight on MegaFace, a dataset created by a consortium of research institutions and commercial companies as a derivative of the YFCC100M dataset. The dataset includes 3 million CC-licensed photographs and is the most relevant dataset for face recognition research, benchmarking and training. Harvey's research presented the dataset as a privacy-invading tool, consisting of photos of individuals used without their consent. 

MegaFace became exemplary of the tension between the open sharing of photographs of people – with tools like the Creative Commons licenses – and potential harms, mainly related to privacy violations and extractive use of personal data. 

For the open movement – actors who contribute to resources based on non-exclusive forms of intellectual property ownership and advocate for these forms – the MegaFace story illustrated new challenges that open sharing faces in a changed online environment. 

While the case seemed not to involve any use that violated the licensing conditions, it did illustrate the limits of copyright licenses for the use of images that also included personality rights. It forced stewards of open licensing to consider issues beyond the remit of copyright law and the ethical aspects of open licensing. 

The media picked up the story of the use of openly licensed content in datasets that serve facial recognition training. Stories like the MegaFace case became a symbol of potential harm that can be a side effect of open sharing. 

In 2022, the major datasets built with CC-licensed content are still in use. Over the years, these datasets were used to train facial recognition models that were later used in hundreds of projects, including the development of military technologies or surveillance solutions. It is time to find ways to manage both the open resources and the AI solutions built on top of them in a way that is more sustainable and reduces harm. 

Through the AI_Commons project, Open Future Foundation wants to contribute to a collective exploration of solutions to these challenges. We hope that there are lessons learned that can improve the governance of these datasets – as they continue to be used and new datasets are continuously designed and deployed. We aim to initiate a debate on how the datasets, underlying photographs, and their uses can be governed as a commons.


The purpose of this whitepaper is to present the case of the use of openly licensed photographs for AI facial recognition training datasets. More precisely, the case concerns photographs of people that photographers have published under Creative Commons licenses on the Flickr platform (and to a lesser extent, other media platforms supporting CC licensing, for example Wikimedia Commons or Youtube). Notably, the personality rights of the subjects of photographs have not been explicitly waived. And the photographs have largely been published before the first significant examples of face recognition training. 

In the first chapter, we present a short history of the development of facial recognition training datasets, which were created on the basis of CC-licensed photographs. In this paper, whenever “the case” is mentioned, it refers to this process. Over the years, over a dozen different datasets have been created using CC-licensed content. Although they differ in details, they all follow a similar process, with a range of similar characteristics of the datasets and similar challenges that they raise. In this chapter, we also place this history in the context of trends related both to online sharing and the development of so-called “AI technologies.”

In recent years, the case has been referred to as one of the more controversial uses of CC-licensed content and of potential risks or harms related to these uses. It is, after all, a case in which the existence of a pool of photographs of people, made available online for free reuse, contributes to the creation of military and surveillance technologies. 

The case, therefore, creates an opportunity to ask fundamental questions about the challenges that open licensing faces today. Its relevance is related not just to privacy challenges, but also to the fact that the case concerns emergent technologies that potentially exploit the commons for private gain at a great scale. We explore different aspects that make this case so relevant in the second part of this white paper. 

Yet the case is more complex, as it also concerns the perceived risks of emergent technologies. This is a case of massive-scale of unexpected uses of private faces happening due to technological developments that could not be foreseen when the underlying images were licensed. All these factors, taken together, make this case a unique challenge to the open licensing model. 

Training datasets are a fundamental tool enabling the development of face recognition technologies. For this project, we commissioned a study of these datasets from Adam Harvey, an artist and research scientist working on computer vision, privacy, and surveillance. In his essay, Harvey notes that “Suitable training for a face recognition system would require millions, tens of millions, or even hundreds of millions of faces. Getting access to that data is the hidden game-changer in face recognition systems”. And this data was found in the pool of millions of openly licensed photographs and videos, made freely available on platforms like Flickr, Wikimedia Commons or Youtube. 

Since the case of the MegaFace dataset received media attention in 2019, it has been used to illustrate an inherent conflict between openness and privacy, and to demonstrate how value can be extracted from the commons for commercial gain. In the background, there are growing concerns about the ethics of artificial intelligence and machine learning technologies, especially in relation to the use of facial recognition technologies for surveillance or military purposes. 

We launched the AI_Commons initiative to contribute to a collective exploration of solutions to these challenges. By studying this case, we hope to understand how openly shared resources can be governed to balance potentially conflicting goals, and in particular open sharing with the protection of privacy and other rights.

We also see this as a case that concerns the irrevocability of CC licenses and unintended or unexpected uses of CC-licensed works. Finding solutions to this case will contribute to making the CC-licensing stack and the Open Access commons future-proof.

Finally, we want to explore what are the limits to open licensing that might have emerged due to changing technological, social or business contexts. We are exploring whether we need a stronger, more managed commons and data governance for some types of data.

Anna Mazgal, in her essay about the case, rightly notes that there is a worry in the open movement, that addressing this case can lead to the conclusion that the commons are the problem. We agree with her when she notes that “The resolution of the “to care or not to care” dilemma is not to stop contributing to the commons.” Instead, the solution is to agree that open licensing operates in complex systems, which entail systemic risks. And to address them, we need to not only maintain these frameworks, but review and adapt them. This is one of the goals of this initiative.

In order to do this, we frame the case as a life cycle of data and content flows that involves different actors in the process of creating, sharing and using openly licensed photographs of people aggregated into datasets. From this perspective, we are considering varied factors that structure the case, including not just the law but also social norms and values, and even social imaginaries – that relate, for example, to the faces that are being used as a raw resource for facial recognition systems. In this way, questions related to the responsibility of actors are just as important as those concerning the legality of their actions. 

The diagram below shows this life cycle, including key actors and data or content that flows between them.

Diagram representing the lifecycle of face recognition datasets created with openly licensed photographs of people.

Figure 1. The lifecycle of face recognition datasets created with openly licensed photographs of people.

This case is traditionally seen as a challenge to open licensing. Questions about the case have been often asked of the Creative Commons organization, which stewards the licenses. And it is true that any improvements to managing these resources will benefit the open licensing model in general by making it fitter for purpose and future-proof in changing circumstances. 

At the same time, the case is a much broader one. In particular, privacy and personality rights need to be taken into account to understand the case and the challenges that it raises – also to open licensing itself. For this reason, exploring this case should be a conversation between activists and experts working on different aspects of digital rights. And because of this, in Chapter 3, we explore this case through several different lenses: social norms, copyright, privacy and research ethics lens.

 Some of the potential solutions require the engagement of other actors than the stakeholders of open licensing, such as license stewards or platforms that deploy these licenses. Thus an audit of this case should engage a much wider range of stakeholders – something that we highlight with our life cycle approach. Solutions to the challenges should, in particular, be sought by creators of the face recognition training datasets, and possibly also by their users. 

A short note is needed at the end of this introduction on the technical aspects of this case. In general terms, the case is often presented as one concerning “artificial intelligence” (AI). Yet more precisely, only a narrow set of broadly understood AI technologies is relevant to this case: machine vision algorithms capable of detecting faces in images and then recognizing faces by comparing face images. Face recognition technologies rely on Deep Convolutional Neural Network (DCNN) algorithms. Furthermore, the keys to this case are the training datasets used to train, validate and test algorithms – and not the algorithms or models themselves. For this reason, we refrain from using the term “artificial intelligence” and prefer the more precise term “face recognition technologies.” At the same time, we acknowledge that this case is relevant for broader debates about the ethics and governance of AI research and technologies.

In this white paper, we refer to multiple face recognition training datasets. Detailed descriptions of these datasets, including content license composition and notes on their usage, are available in Adam Harvey’s analysis.

The case as a governance challenge

We propose to frame the challenge as one of finding improved ways of governing resources that are key to this case. This case has been ongoing for almost a decade, and any social harms have already occurred – and, therefore, there are few possibilities to mitigate them. At the same time, these datasets will continue to be used, and new ones are being deployed. Hence, here are lessons to be learned from this case that can be applied to the management of datasets for AI training. 

By governance, we mean coordinated actions of different actors, using different instruments, methods and strategies that, taken together, create rules and norms. We use the term “governance” to cover these different means of ordering, most importantly including not just legal frameworks, but also social norms and values and even social imaginaries. All these different factors structure this case and determine what is permitted, required or prohibited by different actors.

Open content is largely seen as being by definition ungoverned, beyond the limited governance provided by open licenses – themselves tools that introduce minimal rules, focused on ensuring the greatest, unencumbered sharing and use of resources. In order to find solutions to the challenges raised by this case, we need to look beyond this traditional framing. This needs to involve multiple actors in this space and look at not just the legal and regulatory aspects but also norms and ethics. In other terms, we want to shift from thinking of these resources as open to a perspective that treats them as a commons. And therefore assumes more complex, collective governance of shared resources. With this white paper, we aim to initiate a debate on how the datasets and the underlying photographs and their uses can be governed as a commons.

We approached this research study as one illustrating possible paradoxes of open sharing. As we explored it, it became clear that this is a case not just about open licenses. And even framing it as a tension between sharing and privacy is insufficient. A more complex approach is necessary. Nevertheless, the specificity of this case is related to the fact that the resources used to create all the key face recognition training datasets were CC-licensed. For this reason, in our analysis, we limit ourselves to only relevant issues, taking into account this specific aspect of the case. This makes it relevant not just for AI research ethics but also for the stewardship of open licenses. For this reason, we consider broader debates on AI ethics, on biometric technologies, or on the regulation of machine learning only as context (albeit one that demonstrates the significance of the case).

1. A short history of openly licensed datasets

In 2014, Yahoo released Yahoo! Flickr Creative Commons 100 Million (YFCC100M), an image dataset for a broad range of computer vision uses. The consortium that created the dataset included Yahoo Labs, Lawrence Livermore National Laboratory (LLNL), Berkeley, Snapchat and In-Q-Tel (a national security research institution). It consisted of 99.2 million Creative Commons licensed images and 0.7 million videos taken from its’ subsidiary, image sharing service.

The dataset, which includes at least 11 million images of people1, was part of Yahoo’s Webscope Program, which used the company’s resources to create a reference library of datasets for machine learning. According to Yahoo, the goal was to “advance the state of knowledge and understanding in web sciences.” The datasets were, in principle, made available only for academic use, based on a Data Sharing Agreement.

The creators of the dataset wanted to provide a better resource than the one-off datasets used for research in the past. They were also looking to benefit from the vast resources of “photos in the wild.” Facial recognition researchers use the term to describe “natural” images found on the Web, which are more suited for training algorithms than datasets prepared in data labs. 

Previously, researchers relied on studio photos, which had limited use as AI training sets, as they were unrealistic and did not represent real-life situations - leading the AI systems to perform poorly in the real world. “Lenna,” a single image of a woman, has been used since 1973 as the standard for testing image processing software (and should be seen as a case where all the ethical challenges around the use of faces of people for IT research become visible. 

Lena Forsén, the portrayed woman, was described by Jennifer Ding as “one of the only women this well referenced, respected, and remembered in [the field of image recognition], … known for a nude photo that was taken of her and is now used without her consent.”

In 2007, the University of Massachusetts, Amherst, released the Labelled Faces in The Wild (LFW) dataset, which, for the first time, was the result of scraping images from the Internet. The dataset demonstrated to machine vision researchers the potential of obtaining data from the Web. It became the most well-known and used facial recognition dataset on the Internet. But its creators had no clear understanding of the legal status of the content. This uncertainty constituted a significant obstacle. 

With the YFCC100M dataset, the researchers felt they could build a much more robust dataset, one that was legal to use. “On Flickr, photos, their metadata, their social ecosystem, and the pixels themselves make for a vibrant environment for answering many research questions at scale,” Yahoo researchers wrote when releasing the dataset.

The YFCC100M dataset was a vast improvement in size and legal certainty. The whitepaper for the dataset describes it as a collection that is, on the one hand, “comprehensive and representative of real-world photography” and, on the other, “free and legal to use.” 

The YFCC100M dataset remains, until today, the largest public multimedia dataset for AI training and other computer vision research. And the fact that the dataset consisted of CC-licensed images and was perceived as freely shared for other researchers (under limited additional terms of the license covering the dataset itself) made it a commonly used, standardized tool. 

Today, eight years later, multiple datasets reuse the YFCC100M image pool, including MegaFace, Flickr Diverse Faces, and IBM Diversity in Faces. It also defined a solution for solving the issue of obtaining permission to use “images in the wild” by relying on open licensing. Other datasets follow the same design principles and rely on the pools of open visual content made available under CC licenses on a small number of online platforms: Flickr, Wikimedia Commons and Youtube. 

Of these datasets, MegaFace (released in 2015 and improved in 2016) is particularly important as the foundation for facial recognition research and solutions. The dataset, created out of the YFCC100M image pool, includes 3.3 million images that are all CC-licensed and sourced from Flickr. While the YFCC100M dataset is just a metadata file, MegaFace includes the images themselves. As such, it includes everything needed to initiate a face recognition research project.

Some of these datasets have specific purposes or were created – like the Diversity in Faces dataset – to fix problems (in this case with bias) with earlier, existing datasets. Adam Harvey has written detailed case studies of the 13 critical datasets based on openly licensed photographs of people. The datasets follow the general model but differ in the composition of licenses under which the images have been originally shared, and in how they frame terms of use of the datasets.2

It is not an exaggeration to say that these datasets played a crucial role in establishing contemporary facial recognition models. And in turn, led to the development of products, services and industries. Developing highly accurate facial recognition technologies without these datasets might have been impossible – or much more complex.  

Currently, facial recognition methods and tools are increasingly used in all spheres of life. Researchers used these datasets to advance academic and non-commercial research, but they also are parts of commercial services and are even used for law enforcement and military purposes. 

Context: the challenges of open datasets

The YFCC100M (and other CC-licensed datasets for AI training) could be seen as prime examples of the value proposition of open sharing and reuse of content. Their availability for reuse, for free, legally and for any purpose enabled the creation of research tools that then proliferated across multiple research and industry fields. This is one side of the story of the open facial recognition training datasets. But over the years, ethical concerns began to emerge.

In 2019, CNET published an article on IBM’s “Diversity in Faces” dataset, a sample of the huge YFCC100M dataset, titled “IBM stirs controversy by using Flickr photos for AI facial recognition.”
In the same year, the New York Times published a story on MegaFace, another training dataset with 700,000 photos taken from Flickr. The story was based on, a research project by Adam Harvey and Jules LaPlace (later rebranded as

The story frames the creation and use of the MegaFace dataset, not just in terms of advances in machine vision research, and focuses on privacy harms caused by these technologies. It showcases the unexpected uses of openly licensed content that emerged over the years: “Who could have possibly predicted that a snapshot of a toddler in 2005 would contribute, a decade and a half later, to the development of bleeding-edge surveillance technology?”. And it frames the examples of criticism of how the photographs have been used as a sign that users are “waking up” and are increasingly vigilant about their privacy and its violations. 

NBC reported about the Diversity in Faces (DiF) dataset created by IBM by selecting 1 million photos from the YFCC100M dataset. The story started with the lead: “People’s faces are being used without their permission, in order to power technology that could eventually be used to surveil them, legal experts say.” And quoted Brian Brackeen, former CEO of the facial recognition company Kairos, who described the way the dataset is being used as “the money laundering of facial recognition. You are laundering the IP and privacy rights out of the faces.” A piece in the Verge called the dataset and images of people it gathered “food for algorithms.”

Since then, the cases of facial recognition training datasets that use openly licensing photos became emblematic of tensions between open sharing of content and unethical or harmful uses of such content, which can adversely affect users’ fundamental rights. 

For researchers studying information law, the case became interesting for several reasons. For some of them, focusing on intellectual property law, the case raises interesting questions related to a better understanding of how open licenses regulate the use of online content and where are the limits of such legal sharing. They also study how facial recognition training might fit within the bounds of copyright exceptions and limitations, including those for text and data mining3. And then others see the case as an opportunity to study tensions between sharing of content enabled by open licensing and privacy or other fundamental rights.4

Context: AI_Commons and the Paradox of Open

The case is an example of what we have called the Paradox of Open: that open is increasingly not just a challenger but also an enabler of concentrations of power. It is a prime example of how opening resources exposes them to imbalances of power and potentially to unintended, even harmful uses. 

This paradox is emblematic of the current digital environment, where, over the last decade, power imbalances and harms have become increasingly visible and are the subject of both public and regulatory debates. There is a clear, shared sense that the Internet of today is different than that of the 2000s – mainly due to the growing influence of commercial platforms and their extractive business models.

These changes, both to the technological context and the social zeitgeist, create challenges for organizations that support and promote open sharing. This concerns organizations stewarding open licenses but also stewards of open infrastructure and advocates for open. 

There is a sense that radical changes to the online environment require a revisiting of the open sharing model and its tools, and a review of their fit for purpose. The main challenges concern, on the one hand, harms and infringements of fundamental rights, and on the other, power dynamics and imbalances related to the use – and benefits of using – open resources.

There have been few debates regarding these issues, and most importantly, the intersection of privacy and fundamental rights, especially privacy, has not been appropriately investigated. There is a sense that the CC licensing stack has been developed in an “intellectual property silo,” with design decisions focused solely on issues related to copyright and its traditionally perceived impact – for example, creativity, sharing of content, and access to knowledge. 

In the debates that started in 2019, the Creative Commons organization has often been singled out as a key actor with the capacity, and responsibility, to address this potential paradox. And Creative Commons has been vocal about the case. In 2019, its CEO at the time, Ryan Merkeley, published a statement on “shared images in facial recognition AI.” In it, he acknowledges the issue as one of potential privacy violation and frames it as being beyond the scope of Creative Commons stewardship duty and responsibilities: “But copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online. Those issues rightly belong in the public policy space, and good solutions will consider both the law and the community norms of CC licenses and content shared online in general.”

In a following statement, two years later, CC confirms that licenses operate within the copyright system and do not cover either privacy and personality rights or ethical considerations. But it also acknowledges that sharing of content and the open community are affected by these issues, even if they lie beyond what is traditionally perceived as the scope of activities and duties of open advocates: “The legal uncertainty caused by ethical concerns around AI, the lack of transparency of AI algorithms, and the patterns of privatization and enclosure of AI outputs, all together constitute yet another obstacle to better sharing. Indeed, for many creators, these concerns are a reason not to share”. 

CC’s position on the case outlines how the case can be framed from the perspective of advocates of open licensing and sharing. Firstly, there is agreement that the challenges largely fall outside the copyright system in which the licenses operate. Thus changes to the licenses themselves are not a solution. In general, there are limits to seeking copyright-based solutions, for example, by fighting potential copyright infringements. Secondly, there is an increasing sense that the case needs to be addressed, by looking at issues that open advocates typically consider to fall outside of the scope of their activities. Broadly speaking, there is a  need to consider the impact on digital rights on the one hand and issues related to research and technology ethics on the other5.

And to solve the challenges related to the use of open datasets for facial recognition training, we need to not only look beyond copyright, and find frames for balancing copyright-related goals of sharing content with those of protecting privacy and other rights. We also need to see the case as a broader system in which data flows among multiple actors, including subjects of the photographs, photographers who take and license and share the photos online, platforms that share content, researchers and entities creating the datasets and sharing them, and finally users of these datasets. All of them potentially might have to take responsibility for finding solutions. And these require addressing issues that fall into varied categories: copyright, privacy and basic rights, research ethics and responsible AI technologies. 

Portrayals of the case often focus on a single actor or a category of actors. For example, an exploration of the case by the Nature journal has been largely limited to the issue of research ethics and the community of researchers creating and using the datasets. 

And some media portrayals have solely focused on the role of CC, implicitly suggesting that it has a key responsibility in fixing the case. In fact, CC licenses have been just one factor that contributed to this case’s impact. 

2. Relevance of the case: people’s faces used at scale for commercial gain

Twenty years have passed since the Creative Commons licenses were launched, and almost a decade since the first AI training dataset was created using CC-licensed content. Both CC licensing (together with platforms that support such licensing) and the face recognition training datasets are today robust tools used at a great scale, for developing emergent AI technologies, including face recognition systems. The case is, therefore, fundamental for understanding the challenges that open licensing faces today, under pressure from technological progress. It also enables a better understanding of responsibly governing the lifecycle of AI technologies. 

The case is often described as one where an open content sharing model based on copyright law fails to properly address privacy-related issues, or contributes to developing harmful technologies. Indeed, privacy-related challenges related to creating and using these datasets, identified by journalists and experts in 2019, have still not been solved. The case points to the urgent need to coordinate the governance of openly licensed materials that contain personal data. And it no longer feels satisfactory to say – the way that open advocates sometimes do – that privacy challenges related to open licensing can be brushed aside as a matter not related to copyright law. 

Yet the case is more complex. Its novelty and relevance are related not just to privacy challenges, but also to the fact that the case concerns emergent technologies and novel risks associated with them – or perceived to exist. This is a case of unexpected uses, of private faces,  at a massive scale, happening due to developments in technology that could not be foreseen when the underlying images were licensed. All these factors, taken together, make this case a unique challenge to the open licensing model. 

Unexpected uses

As an outcome of this process uses of CC-licensed photographs, which were unexpected by the photographers who licensed and published them, occurred. These went against the expectations, and even imagination, of people who made the content available. And also of the creators of the CC licensing tools. Some of the key questions regarding this case are therefore no longer legal in nature – they concern social norms related to open licensing and the possibility that some form of social contract implicit in this licensing model has been broken by a combination of unexpected and unprecedented scale, uses and technologies. The potential role of such norms, alongside legal aspects of the case, has been acknowledged by Creative Commons in its position on the case from 2019.

In a way, the creators of YFC100M themselves acknowledge that creating an AI training dataset is a new and unexpected use of photographs. They begin the YFCC100M white paper with a short outline of changes that happened to photography and our understanding of it. Fifty years ago, there was “a world of unprocessed rolls of C-41 sitting in a fridge.” Today, photos can be digitally manipulated, made available online and shared for further use. This new, viral character of images, which potentially can be shared and accessed by millions, is a phenomenon that in the 2010s was already well understood by mainstream users. 

Yahoo researchers proposed one more perspective: the pool of photographs shared publicly online should be considered collectively, as a body of knowledge that “goes beyond what is captured in any individual snapshot.” As such, it is a “vibrant environment for finding solutions to many research questions at scale.” And in order to benefit from it fully, the photographs were repackaged into the shape of a dataset consisting solely of their metadata. In this form, they became a standardized resource that met the call for “scale, openness and diversity.” 6

 The Creative Commons licenses have been functioning with the motto “permission given in advance” and with a licensing framework that allows for any future uses that meet general licensing conditions. A carte blanche permission also covers novel and innovative uses that do not exist when the work is licensed. The CC FAQ file states that the licenses “have been carefully designed to work with all new technologies where copyright comes into play.” 

The case of AI training datasets created from open content tests this assumption in practice. It took little more than a decade since the CC licenses were launched, and a decade of sharing photographs on Flickr, for the first major unanticipated use to occur. One that was not yet imagined when the CC licenses were crafted and when the photographs were shared. And more generally, the developments with AI technologies stretched – and often reached beyond – the social imagination of possible uses of content such as photographs. When the first datasets were being deployed to develop face recognition algorithms, there was no understanding – and even no awareness – of these technologies by the general public. 

Furthermore, harmful uses enabled by CC licensing have always been an issue of concern for the developers and users of the licenses. Therefore, this case, where there is a connection between CC-licensed photographs and military technology, is so controversial. At some stage of the design of these military technologies, almost certainly, facial recognition training datasets based on open licenses were used. Hence, this case is one where efforts to create a stronger Public Domain through open licensing of content become directly connected with the development of the surveillance-industrial complex. 

These novel uses are still covered by the broad and flexible permissions of the CC licenses, seen as legal instruments. But these unexpected, previously unimagined uses can nevertheless break social norms and, therefore, also the social contract on which these licenses are ultimately founded. Permissionless innovation was a core objective of Creative Commons in introducing open licensing models. In hindsight, advocates of open licensing have been paying more attention to potential beneficial uses, hoping that harmful uses will not surface or be marginal. And indeed, for about a decade, no major harmful or unintended uses emerged – until the case we are analyzing emerged and problematic uses happening at scale became visible. We note that the Open Movement as a whole made few efforts to monitor and mitigate uses that are problematic or harmful.  

Today, social norms influence our perception of CC licensing, which is framed by some as less permissive than in theory. An article in the New York Times that covered the case quotes Chloe Papa: “It’s gross and uncomfortable. I wish they would have asked me first if I wanted to be part of it. I think artificial intelligence is cool and I want it to be smarter, but generally you ask people to participate in research.”7 Subjects of photographs that are part of these datasets often focus not on whether licenses have been broken, but on the fact that this kind of use was not imagined or expected by them. 

Uses of Faces

Reusing user-generated photographs would not be a contentious issue if these were not photographs of people and their faces. Training datasets with photographs of objects receive little critical attention, just as most issues related to non-personal data (which raises other types of concerns, largely related to geopolitics and digital sovereignty). The potentially exploitative use of people’s faces is what makes this case particularly strong and symbolic. There is specificity to the legal protection of faces. But there is also a broader cultural context: in the last years, there has been an ongoing, turbulent transformation of how faces are treated by people and used in different ways as a resource.

The increasing use of facial recognition technologies has led to a growing awareness that faces are becoming a raw resource for algorithmically-driven surveillance services. Yet, for years, the canonical facial recognition issue requiring public scrutiny was that of a camera equipped with facial recognition capacity, tracking people in public spaces. 

And recently, the issue became even more prominent because of Clearview AI. The company claims to have built a database of over 3 billion facial images by scraping the Web and describes itself as the “World’s Largest Facial Network.” In November 2021, the UK  Information Commissioner’s Office (ICO) ICO found this practice incompatible with UK national data protection laws and fined the company 17 million GBP. On similar legal grounds, the French privacy watchdog –  the CNIL  – ordered Clearview AI to immediately cease the collection and use of French residents’ data as well as order its deletion. 

Growing awareness of the issues related to face recognition training datasets has shown that the challenges also lie “upstream”, and concern how algorithms – which are at the heart of surveillance technologies – are built. Faces are specific as a type of resource but at the same time ubiquitous in current popular culture. Emblematically, the selfie – either a photograph or video – is today a dominant cultural form, and one on which business models of many social networks are built. 

New technological capacities to manipulate faces have been developed and made mainstream in recent years. Remixing faces with the use of “face-swapping” apps has become a mundane, commonplace experience. At the same time, there is increased awareness and public debate about deep fakes – synthetic videos or images of faces created with the use of AI technologies – and the risks they pose, mainly as a new vector of disinformation that can destabilize the public debate. Finally, synthetic images of actors are increasingly used in movies.  

Faces are, therefore, increasingly treated not just by lawmakers and data protection agencies but also by the general population as more than just personal visages. They are seen as resources that can be used and abused.8 In Clearview AI's case, the scraped faces database fuels a surveillance service as a reference database. And in the case that we are studying, faces are used to train machine vision algorithms – which are then deployed in projects like Clearview AI. 

Faces were always valuable, and some legal systems around the world recognize that by protecting them not just on the basis of privacy laws, but in terms of an asset akin to intellectual property. But in practical terms – even in those jurisdictions where appropriate laws exist – the rules applied to a narrow category of people: politicians, artists, celebrities, business people. Today, almost everyone faces the challenge of balancing between seeing the face as their most valuable personal data, an image most closely connected with their personhood, and seeing it as a resource. 

Online influencers, like all celebrities, use their face as a resource professionally. But many other people, in everyday life, use their face as a resource on which they build their personal communication. This explains the strong reactions of many people who discover their faces within the facial recognition datasets, and learn that they stopped being “theirs” – and are instead a raw resource for AI technologies.

This can be seen as an analogous process to the one that Creative Commons diagnosed at the turn of the century: the democratization of content production, which necessitated changes to copyright laws. Today, a similar shift might be happening with regard to faces and a broad range of personality and privacy rights. And, even in those jurisdictions where such laws don’t exist, this remains an ethical matter – whether every person should have some say regarding how their likeness is being used. 

Debates on facial recognition, and the use of faces as a raw resource, are gaining intensity. As a sign of the times, facial recognition features introduced in 2010 were removed from Facebook in 2021. In justifying the decision, the company cited “many concerns about the place of facial recognition technology in society.” In Europe, lawmakers are debating banning biometric and facial recognition technologies, which in many cases have been developed using the datasets that are the topic of this study. 

In the YFCC100M database, at least 10% of images are those of people. Photographs of faces, therefore, constitute a significant part of all openly shared photographs. As we will see, the challenge raised by this case is partially due to the fact that open licensing frameworks never considered them as a specific resource, requiring a distinct approach.

Uses at a massive scale

AI technologies are inherently “Big Data” technologies that often require large datasets. Machine learning technologies create mass-scale use cases that are hard for most users to grasp, as they involve the analysis of millions of images. The massive scale of use is another factor that is not typically imagined by users, as they share their content under an open license. 

Use at scale also means the use of aggregated resources, which again, even when allowed by the open licenses, raises questions on the validity or ethics of use. At a massive scale, some elements of the open licensing framework, or normative assumptions that underpin such frameworks, might fail. For example, attribution is formally possible – by virtue of including it in the metadata of the dataset – but in practical terms, it is meaningless, when attribution needs to be given to millions of authors.

Finally, use at scale is characteristic of commercial users. Quantitative researchers are the only other category of users who have the capacity to use content at scale. This is once again a case of how social norms and expectations shape the public debate around open licensing. What is nominally allowed by the license may nevertheless become an object of public scrutiny or even outrage. In the past, commercial use of CC-licensed content has been one of the more contentious issues for open advocates9

Online platforms like Flickr are a key category of users that are able to use openly licensed content at scale. And in their case, the issue is further complicated by the fact that platforms are perceived as having a stewardship role, when facilitating the sharing of content. 

This was the case of another project that Yahoo developed based on Flickr photographs, Flickr Wall Art, which allowed users to order from Yahoo prints of CC-licensed photographs.10 This was another use of CC-licensed photos at a massive scale, as potentially 50 million different images were made available, according to Yahoo. Yahoo used only content that was licensed to allow commercial use, and framed this as a new service offered to photographers – the announcement called this “a new opportunity to share your beautiful photos with the world.” 

Nevertheless, the company was criticized for attempting to profit from the works of photographers shared on Flickr, without sharing the profit. Since Yahoo was using the works based on CC licenses, it was not sharing any revenue with the photographers. Despite the fact that Yahoo’s use of the content was within the bounds set by the licenses, the company was forced to close the program after a month11

Both the Flickr Wall Art project and the beginning of the case that is the subject of this study predate by a few years the start of a public debate on platforms and the “value gap.” The case pitched formal rules offered by the CC licensing stack with issues concerning fair remuneration of creators – which traditionally were seen as lying beyond the scope of ordering by CC licenses. It is worth noting that it was a case that precedeed by a few years, yet shared some similarities, with a public debate on platforms and the “value gap,” which also concerns fair remuneration for creators. 

The scale of use in the case of AI training datasets becomes even more significant when we compare the size of the datasets with the size of the overall pool of CC-licensed photographs. In 2015, Creative Commons estimated that approximately 391 million images and photos were shared under CC licenses, out of which Flickr hosted 351 million.12 The YFCC100M dataset, therefore, used over a quarter of all images that were CC-licensed at the time. 

This number should be put in the context of a debate on the use of CC-licensed content, which has been an ongoing one since the very beginning of Creative Commons. The use and reuse of publicly shared resources have been seen as both a key goal and a potential challenge – with data showing that many resources are not being used, or even discovered. 

In this context, the fact that a quarter of all photographs was used to establish a resource that in turn generated many further technological projects, could be seen as a major success story for the open movement. But as is clear from this case study, instead the massive scale became one of the contentious factors for this case.

3. Seeing the case through different lenses

In order to better understand the case of AI facial recognition training with the use of openly licensed photographs, it should be conceptualized as a  “life cycle” of content and data flows that begins with the taking of a photograph of a person and results in the use of AI training dataset to build models that are at the heart of various AI technologies. This lifecycle connects together an ecosystem of different actors who produce, share and use photographs, and then produce, share and use datasets consisting of aggregated photographs. The key actors in this lifecycle include:

  • Subjects of photographs, whose faces are ultimately the raw resource for facial recognition AI training. 

  • Photographers who took photographs of the subjects and published them online under an open license. In some cases, they are also the subjects in the photographs.

  • Online platforms that enable the sharing of photographs under open licenses. 

  • Creative Commons organization, as the steward of the licensing stack used to openly share photographs.

  • Researchers and institutions who are the creators and stewards of the datasets

  • Users of the datasets, including both research and commercial or industrial actors. 

At each stage of this lifecycle, the use of photographs and associated metadata can be governed. At the same time, different challenges and risks appear at each stage, which might necessitate governance. And “upstream” governance decisions affect “downstream” uses. In particular, CC licenses are used by photographers to share the structure of the photographs for the whole downstream lifecycle. But they should not be seen as immutable – as they are used, the licensing conditions become interpreted and tested in practice. 

It should be noted that this life cycle does not include face recognition technologies, including algorithms and models developed using the datasets. These technologies fall beyond the scope of this case. At the same time, the current public debate on the regulation of face recognition centers on this “downstream” stage, which is only loosely connected to the lifecycle we are exploring. In a recent paper, Abeba Birhane, Vinay Uday Prabhu and Emmanuel Kahembwe argue that advances in models are rapid and embraced by the research community, while advances in responsible design of datasets are ignored or slow – pointing to the failure to curate key datasets properly.13 

This lifecycle approach allows us to understand, in turn, the perspective of different stakeholders in this case. And also, different aspects of the case become visible at different stages. We address them by looking at this case in turn through the lens of social norms, copyright, privacy and research ethics.  

Social norms lens

Although the case of AI training with CC-licensed photographs of people gained the attention of academics and the media several years ago, there have been few actionable insights into the perceptions of users – people who publish photos on Flickr and other photo-sharing services or those who are the subjects of these photos. 

A rare example of such insights is “Discriminator,” an interactive documentary by Brett Gaylor about how photographs from his honeymoon became part of the MegaFace dataset. However, these focus on the personal experience of Gaylor, who is known for his activist style of filmmaking. 

And in 2019, the Ada Lovelace Institute published the results of a UK-based public opinion survey on facial recognition technologies. The survey shows high levels of distrust, with 55% of respondents wanting restrictions to be imposed on the use of facial recognition by the police, and 46% wanting to opt out of the use of these technologies.14 As noted above, the survey topic connects with the issues we are studying but falls beyond the life cycle that we describe.

Otherwise, the views of the majority of people whose works and faces form the AI training datasets are largely unknown. And these views should be taken into account when designing governance solutions for this case. We, therefore, begin with the social lens to shed some light on popular perceptions and attitudes toward open licensing and AI training.

Together with Selkie Research, we conducted a survey study to fill this gap and provide an initial understanding of these perceptions. The online-based survey was advertised on Flickr and in the Wikimedia community, aiming to directly reach the photographers that publish photos in these key services, from which dataset photographs are obtained. Our sample consists of 142 people publishing photographs online, of whom 90% publish photos on Flickr, and over 70% use Creative Commons licenses.  

While the small sample size of the sample limits our ability to draw conclusions, it still offers an exploration of views that have not been researched before. And some of the results paint a clear and stark picture of the users’ views of the users.

When asked about the motives for sharing photos, four of them stood out: positive contribution to the community (72%), enabling others to freely use content (71%), helping others by sharing content (50%) and documenting and sharing cultural heritage (49%). Mapping these motives on the Basic Human Values model shows that photographers who publish using CC licenses are highly socially oriented and prioritize community welfare and cooperation as values, over protection-related values like self-enhancement, conservation or security. 

Diagram with the results from the survey question "Which of the following reasons motivate you to share photos under a Creative Commons license? Please select the 3 most important ones."

Figure 2. Which of the following reasons motivate you to share photos under a Creative Commons license? Please select the 3 most important ones.

The survey also shed some light on consent-related practices (an issue that is crucial when studying the case with a privacy lens). Almost half of the respondents do not obtain consent to take a photograph of photographed subjects, and a third obtain verbal consent. Written consent is obtained by less than 5% of photographers. A related question concerned the terms of service of the Flickr platform, which were read by 60% of the respondents. Of those, 70% declared that they had a specific interest in copyright and licensing rules, while only 25% paid attention to data protection provisions. These results suggest an overall low awareness of data protection rules. At the same time, over 70% of users declare that they have in the past decided not to share a photo that they took, mainly citing the protection of their subjects’ privacy or dignity. 

Diagram with results of survey question: "When taking a photograph of a person, how do you usually obtain consent to take the photo of their face?"

Figure 3. When taking a photograph of a person, how do you usually obtain consent to take the photo of their face?

When asked about the existence of risks and dangers of sharing photos with faces under CC licenses, only 35% of respondents answered in the affirmative – the majority do not perceive such risks. Of those, less than 20% mentioned unprompted the development of AI technologies when asked about specific risks – a similar number mentioned issues related to copyright infringement. These results suggest a relatively low sense of risks and in particular those related to AI development.

With the survey, we also aimed to understand attitudes toward different shared content uses. We asked respondents to describe as acceptable or not cases that included two differentiating factors: commercial usage and AI training. The highest level of reluctance was for commercial use of the photo, whether for AI-related use or not.

Diagram showing the responses to the survey question: " Please imagine that you shared on Flickr a photo of your friend's face. You made the photo available under a Creative Commons license that allows the photograph to be freely shared. How acceptable would the following situations be for you?"

Figure 4. Please imagine that you shared on Flickr a photo of your friend's face. You made the photo available under a Creative Commons license that allows the photograph to be freely shared. How acceptable would the following situations be for you?

Respondents also saw as problematic those cases where facial recognition solutions were used by governments, especially by authoritarian states. Depending on the purpose, they also had strongly different opinions about using photographs for AI training, depending on the purpose. Healthcare, academic research and educational uses were declared acceptable by at least 2/3 of the respondents. While security and surveillance, business and military uses were declared unacceptable by over 80% of respondents. 

DIagram for the responses to the survey question: "Let's assume that you can set specific sharing permissions for the use of photos with faces. Please indicate whether you would allow using your photographs to train Artificial Intelligence systems for the following purposes. "

Figure 5. Let's assume that you can set specific sharing permissions for the use of photos with faces. Please indicate whether you would allow using your photographs to train Artificial Intelligence systems for the following purposes. 

We also inquired about emotions triggered by the MegaFace case, which has been the topic of Adam Harvey's research. Asked to describe their emotions, respondents declared high levels of fear, sadness, disgust and anger - and very low levels of trust, joy and anticipation. 

Diagram showing responses to survey uestion: " The description of the hypothetical situation below is based on a real case - a training dataset called MegaFace. What are your feelings upon learning about this? Please indicate this on the scale below, where 1 is the lowest level of a particular emotion, and 5 is the highest level. "0" means that you do not feel this emotion."

Figure 6. The description of the hypothetical situation below is based on a real case - a training dataset called MegaFace. What are your feelings upon learning about this? Please indicate this on the scale below, where 1 is the lowest level of a particular emotion, and 5 is the highest level. "0" means that you do not feel this emotion.

Finally, we investigated the issue of responsibility for the uses of publicly shared photographs with faces, in particular for negative outcomes. It was a question with the highest level of "N/A" responses. This hints at a high level of uncertainty on this issue, particularly regarding subjects of the photos, image hosting platforms and Creative Commons. The results hint at a perceived low level of responsibility of the photographers and photo subjects, and the highest responsibility of the persons or organizations using the photographs.15

Diagram showing responses to survey question: "In your opinion, who should take responsibility for the use of photos with faces, which are shared on image hosting platforms? (in particular for negative outcomes of such use). Please rate on a scale where: 1 is the lowest level of responsibility and 5 is the highest."

Figure 7. In your opinion, who should take responsibility for the use of photos with faces, which are shared on image hosting platforms? (in particular for negative outcomes of such use). Please rate on a scale where: 1 is the lowest level of responsibility and 5 is the highest.

Our survey results suggest that users of image hosting platforms may not perceive AI risk in relation to their photo-sharing activities and generally do not pay a lot of attention to risk management. They also make some privacy-related choices but otherwise do not have a strong practice of obtaining consent or caring about basic rights. At the same time, they have strong, negative emotional reactions when presented with specific scenarios of use (see Figure 3.) or even concrete examples (see Figure 5.). They also have strong and clear preferences regarding acceptable and non-acceptable uses. However, commercial use of the content they share (and in particular of photographed faces) seems to them more problematic than the fact that the use is related to AI technologies. 

The copyright lens is the obvious lens to be applied to this case by people with an open licensing background, and it is the one that has been mainly deployed by Creative Commons when commenting on the case.

From a copyright law perspective, the main question is whether the photographs' open licensing makes the datasets “free and legal to use,” as claimed by dataset creators. As mentioned before, the datasets vary in their composition, with some of them – including MegaFace, MS Coco, and other YFCC100M derivatives – consisting solely of CC-licensed content. In their case, explicit permission has been given in advance. As the datasets consist of content made available under different CC licenses, licensing conditions might limit the scope of the legal use of the overall dataset. Other datasets combined CC-licensed content with photographs covered by traditional copyright, for which permission to use has not been granted in advance.

Regardless of the licensing conditions, the uses might also fall under exceptions and limitations – which vary in different parts of the world. The text and data mining exceptions are worth mentioning here, such as those introduced in Europe in the Copyright Directive16

Adam Harvey argues that since the YFCC100M was created, there has been a misrepresentation of CC licensing conditions by face recognition dataset creators and maintainers (as well as full ignorance of privacy-related aspects, which we consider in the next section). In the research paper announcing the dataset, the creators of YFCC100M make the claim that it is “free and legal to use.”17 Harvey writes that “This misrepresents Creative Commons and set a false precedent for other researchers that reverberated throughout academic and industry research communities, though perhaps that was the intention.”18 This framing has not been questioned by any of the creators of the latter datasets or the AI research community in general. And the legal aspects of the datasets received little scrutiny until 2019, when the case gained public visibility. Even since then, we still lack a definitive analysis of the case from a copyright law perspective.

YFCC100M, the canonical openly licensed dataset, includes only images made available under CC licenses indeed (or more precisely, it is a text file that includes URLs of the images and metadata). Yet over 50% of the photographs are licensed under conditions that include the Non-Commercial attribute and thus prohibit commercial use – making the claim that the dataset is free and legal to use invalid, without adding any qualifications. 

Diagram showing YFCC100M Image License Distribution.

Figure 8. YFCC100M Image License Distribution. Source: Adam Harvey, based on metadata provided by Flickr API in 2020.

Harvey analyzed the use of MegaFace, another AI training dataset that includes around 2.5 million works licensed with a Non-Commercial attribute. He has documented over 20 major datasets used by commercial organizations, including ByteDance, Google, Intel and IBM. 

Diagram showing MegaFace Image License Distribution.

Figure 9. MegaFace Image License Distribution. Source: Adam Harvey, based on metadata provided by Flickr API in 2020.

There is still uncertainty whether these uses, in fact, infringe the terms of the CC licenses – either the prohibition of commercial uses or other elements of the licenses. Andres Guadamuz, a legal scholar with long experience in studying open licensing, argues in his analysis of the copyright aspects of this case that the legal situation will depend on the jurisdiction, as some have adopted data mining exceptions that could cover such uses (albeit these are largely limited to non-commercial data mining). Guadamuz also argues that it is the Non-Commercial licensing condition attached to the majority of the dataset content (which is made available for free) and not the characteristics of the training done with the dataset that should be taken into account. Finally, Guadamuz notes that it would be hard for an individual photographer who would like to sue for copyright infringement to prove that his particular work was used.19

Even if such analysis holds true, it does not change the fact that dataset creators seem not to investigate the datasets' licensing conditions and the photographs they include. A 2022 dataset licensing study concludes that license compliance has significant challenges. First, licensing information is often missing or hard to locate. Secondly, verifying the validity of licenses is difficult, mainly due to the fact that multiple sources, with varying licensing models, are often combined in datasets. Thirdly, the impact of content licenses on the license for the aggregated dataset seems not to be analyzed. Finally, licensing information is ambiguous and does not clearly define rights and obligations. Authors conclude in particular that studied datasets (including ImageNet, MS COCO or FFHQ)20 “might not be suitable to build commercial AI software due to a high risk of potential license violations.”21
Issues with license compliance by creators and users of the datasets might create the opportunity for legal action that would limit the use of the datasets or even lead to their decommissioning. It would also send a clear signal to researchers that these datasets are not compliant with the CC licenses and, thus, not legal to use under the “permission given in advance” model. 

Harvey also notes that while many datasets make an effort to properly include attribution information in machine-readable formats, some do not.22 MegaFace is a dataset whose metadata does not include any of the attribution information available on Flickr. And the MS COCO dataset is another example, as it includes solely direct links to the JPG image files on Flickr, without accompanying metadata. The attribution requirement is flexible in CC licensing and the FAQ on CC licensing states that the method of giving credit “will depend on the medium and means you are using, and may be implemented in any reasonable manner.” Nevertheless, there is a chance that the attribution requirements are not met in the case of some of the datasets due to the way data is presented for datasets counting millions of items. Harvey also notes that the size of metadata files raises challenges with the meaningfulness of provided attribution.23 For example, the YFCC100M includes attribution information in a metadata file that is 12.5GB large. Harvey argues that this way of presenting attribution data can only be understood by people with advanced technical and research skills. He also points to challenges with the attribution of datasets made available through an API, as is the case of the FairFace dataset. In this case, data can be accessed directly and attribution information can be circumvented, according to Harvey.24

Nevertheless, a solely copyright-based perspective is insufficient to understand this case. Ultimately, even if the use of photographs – and the dataset in which they are included – is allowed – either based on the CC licensing rules or on exceptions and limitations provided by the law – issues remain that are related to privacy, use of personal data, or research ethics. 

Challenges with license compliance and potential infringement create a practical opportunity to limit the use of datasets created from openly licensed photographs. But considering just these issues is not enough, as the identified problems would persist even in cases where there is full compliance with open licenses under which the content has been published.

While actors looking at the case from a copyright, or open licensing, perspective acknowledge these challenges, they vary in their opinion on the relevance of the case. Creative Commons has shared an initial opinion in a blog post from 2019. It acknowledged the challenges related to AI development, privacy and research ethics. At the same time, it defined them as challenges for which copyright is not a good tool with which to solve them: “CC licenses were designed to address a specific constraint, which they do very well: unlocking restrictive copyright.” The position was further elaborated in 2021. Again, a clear line was drawn between copyright-related matters that are within the scope of CC’s stewardship of open licensing and challenges that fall beyond copyright. But the position goes on to acknowledge that “The legal uncertainty caused by ethical concerns around AI, the lack of transparency of AI algorithms, and the patterns of privatization and enclosure of AI outputs, all together constitute yet another obstacle to better sharing.”

This position has been expressed by John Weitzmann, General Counsel at Wikimedia Deutschland, in his opinion for the blog. He argues that these are challenges related largely to privacy and personality rights law and should be treated as such “in ways that do not touch the core rights granted in open licensing.” But Weitzmann goes even further and makes a claim that’s both strong and interesting: that the Open Access Commons, created by sharing content under open licenses, has no stewards: “Open Content belongs to all and thus to nobody in particular. It was given not to “us,” but to the commons – piece by piece by initial individual rightsholders.” He thus believes that there are no governance structures that can be introduced ex-post to account for the impact of previously unexpected uses of the commons: “A retroactive change to the notion of Open would be equal to a departure from the existing model into a whole new digital commons through a gigantic fork of code, a schism of Openness.”25

In July 2022, a new license, the BigScience RAIL license, was launched by BigScience, a machine learning research workshop convened by HuggingFace. In parallel, HuggingFace released the BLOOM large language model, using the license. The RAIL license adds use restrictions to a licensing framework that would otherwise be described as an open license. The list of restricted uses includes, among others, uses that: violate laws, harm or exploit minors, disseminate false information that harms other, disseminates personally identifiable information that harms others, disseminates machine-generated content without identifying it as such, harasses others, impersonates others, allows automated decision making that adversely impacts rights of individuals, discriminates based on social behavior or personality characteristics, exploits vulnerabilities of a specific group of persons, discriminates based on legally protected characteristics or categories, provides medical advice or generates information for the administration of justice, law enforcement, immigration or asylum processes.

The license was designed and is being used to release machine learning models – which fall beyond the scope of the life cycle that we are studying. Nevertheless, it is a relevant development that aims to extend the open licensing approach. And the use of a restrictive license like RAIL to share datasets was proposed by Abeba Birhane, Vinay Uday Prabhu and Emmanuel Kahembwe.26

It is worth noting that the license was designed mainly to address issues of responsible AI technologies and as a tool that enforces use restrictions for shared AI models. “By using such licenses, AI developers provide a signal to the AI community, as well as governmental bodies, that they are taking responsibility for their technologies and are encouraging responsible use by downstream users,” write the creators of the license.27 The open licensing aspect is taken as a given, and the researchers refer to openness as a norm in computer science research. And they seem not to be concerned with the implications that their license raises for the use of open licenses. 

The idea of allowing individuals to prohibit the use of licensed work for biometric surveillance has been proposed by Albert Fox Cahn from the Surveillance Technology Oversight Project, in a session organized by Creative Commons at the 2021 CC Summit. The session concerned the need to balance open sharing and privacy interests. 

The privacy right lens

The creation and use of facial recognition training datasets interfere with the privacy of people whose pictures are included in these datasets. The right to privacy is typically considered a negative right, defending the individual's desire to be left alone.28 By interference with the right, we mean that an action encroaches upon, or “punctures,” the sphere of someone’s private life. The existence of this puncture is true, even without determining whether it could be considered legal in a given jurisdiction from a formal point of view. 

The creation of training datasets by scraping content that is already publicly available online seems to be based on the assumption that people who publish content online accept (or at least should accept) this interference. In other words, they have no expectation of privacy. According to this approach, “publicly available” would equal “free for use”.29

In the case of datasets containing pictures posted with the current CC license (version 4.0), this assumption is seemingly justified by the fact that the licensor is expected to waive or not assert any of his or her privacy and other similar personality rights.30 Whereas the right to privacy is typically thought of as a negative right, as noted above, other personality rights also involve a person's interest in representing himself or herself in a public environment and developing their identity and personality. Although the licensor is expected to waive those rights, the current licenses also specifically say that they do not cover such rights. 

It is important to note that a previous version of the license (2.0)31 – the one that had been used on the Flickr platform and under which the vast majority of the images used in the datasets were licensed, did not mention personality rights. The addition of the phrase in the latest iteration of the licenses emphasizes that their intended function is as tools to enable the sharing of intellectual property rather than for the management of privacy rights. 

As far as the privacy rights of other people depicted in the pictures are concerned, the situation becomes even more complicated. The licensor is advised to ensure that, in case the material includes rights held by others, he or she gets permission to sublicense those rights under the CC license. This clause is included as a form of guidance in the “Considerations for licensors and licensees.”32 The fact that privacy and other similar personality rights are not covered by it, as noted above, means that the sublicensing might only apply to intellectual property rights. The Considerations are silent about the “privacy and/or other similar personality rights” held by others.

As a result, the issue of other parties' privacy rights is never brought up. The creators of the datasets seem to assume that the photographer has some form of agreement with the people photographed about the use of the images. There is, however, no proof that consent for processing is always given. The survey by Selkie shows that, in many cases, the opposite is true – more than half of respondents did not obtain consent from the people they photographed.

Given this, the reliance by dataset creators and users on the CC license to justify the interference with people’s privacy and further use of the photos of their faces becomes merely window dressing.

Furthermore, the reliance on the CC license to deal with privacy issues raised by datasets used to train face recognition tools also proves problematic in light of existing data protection regulation provisions, particularly the European GDPR. 33

Pictures of faces used to identify a natural person constitute a special category of personal data. The GDPR prohibits processing this data category unless specific conditions listed in art. 9 GDPR apply. Among these conditions, there is the “explicit consent” of the data subject to the processing or a situation where the data was made “manifestly public” by the data subject. If we stick to the facts, none of these conditions are met in the discussed case.34 The option that remains to justify the processing without consent seems to be, again, the claim that it is necessary for scientific research purposes. This potential justification for processing is included in art. 9 GDPR, accompanied, however, by numerous additional conditions and requirements. In addition, the GDPR recognizes several personal data protection rights. These rights include, for example, the right of the data subject to object to the processing of data concerning them. The entity that processes the data might refuse to honor the request if they demonstrate compelling legal grounds for their activities that override the interest of the person objecting to that. 

From the point of view of data protection rights, as expressed most clearly in the GDPR provisions, the case seems to boil down to whether and under what conditions, in the given circumstances, we accept interference with the rights of people depicted in the photos and the processing of their sensitive personal data without their consent. If put this way, this becomes a balancing exercise of different interests: privacy and the right to control one’s data on the one hand and the interests of research and innovation on the other. It is also a matter of the legal grant made by the use of a CC license and raises the question: of how far one can rely on the CC license.  

So far, we’ve discussed the harm that the gathering and processing of personal data might do to the rights of the person whose image is used. These individual harms can be thought of as having two dimensions. The first relates to what Viljoen calls “sludgy consent” when “corrupted architecture or design process may result in an appearance of consent that in fact violates or undermines true consent.” In the current context, the concept of “sludgy consent” can be applied particularly to the false assumption that the person uploading pictures of people obtained their consent to do that. The second form of individual harm is the harm of access, i.e., when people – in this case, third parties depicted in the pictures – cannot exercise their rights and limit access to information about themselves.35

Although these harms might not be addressed, they can – as shown in the previous sections – at least be conceptualized within the rights frameworks that are in force in Europe. While the traditional accounts of privacy and data protection recognize the interests of data subjects from whom data is collected, they do not account for the effects of using one person’s data to infer information about other people. Arguably, however, these other people have a stake in the terms governing how data is collected and processed.36 Additionally, the current approaches to privacy and data protection ignore the uneven distribution of effects that facial recognition technology has on different populations based on ethnicity, exacerbating the negative social repercussions of being a member of a minority group. These two problems are still unresolved within the privacy and data protection framework in force today.

The research ethics lens

Just as copyright was not designed to address privacy issues,37 it was not intended to solve ethical issues related to research. The case we discuss is a story about the challenges of research ethics related to using publicly available personal data in research, the risk of misusing research results, and the role research and researchers play in developing controversial and potentially harmful technologies. 

The researchers who published the YFC100M dataset considered it a bold move that would foster better science.38 David Ayman Shamma, director of research at Yahoo at the time the YFCC100M dataset was created, described the goals in simple terms: ”We wanted to empower the research community by giving them a robust database.”39 They either did not think about the ethical issues the creation and use of such a dataset entailed or assumed that the CC license addressed them all. This, however, was a serious omission. Copyright was not designed to address ethical issues related to research, such as consent to partake in research, the right to withdraw, or the broader impact of research on society and the risk of harm it might cause.

Voluntary consent to participate in research and the right to withdraw at any time have been the gold standard and guiding ethical principles regarding research with humans. Their importance is related to the history of research misconduct that in Europe reached its peak during the Second World War when Nazi doctors conducted inhumane experiments on prisoners in concentration camps without their consent.40

These basic principles have been challenged in the age of big data research. In the case of research that (re)-uses existing datasets containing massive amounts of personal data collected from different sources, seeking consent for each study might be impossible (as the individuals might not be contactable) or at least impractical. As far as the right to withdraw from research is concerned, this becomes meaningless if a person is not aware that their information is included in the dataset, which has been shown is often the case in the discussed datasets. 

In the case of research that uses previously collected personal data or biological material, it has been argued for some time already (for example, in the context of genetic research) that seeking the individual consent of research subjects might not always be possible or even necessary. Given that, alternatives to traditional consent have been proposed. One such alternative is the concept of broad consent, where the data subject gives consent for their data to be used in certain areas of scientific research at the moment where the data is collected. 

People rarely consider that someone might use it in research when personal data is posted online. While research purposes are sometimes mentioned in the terms and conditions of the Internet platform, few people read these. Such consent, while broad, could hardly be called voluntary or informed. In the case of photos, the assumption that data subjects have provided some form of consent for their data to be used in research is further undermined when these images are not published by the person depicted in them but by someone else, which is the typical case of friends posting each other’s photos. While one could argue that people who make their data public relinquish their expectation of privacy, that can hardly be told about any third-person people caught in a photo.

Using a CC license seems to take care of this issue through general permission for data to be reused, which would also imply research purposes. However, this interpretation is difficult to reconcile with the principle of voluntary, informed consent of the research participant – in this case, the person depicted in the photo – to participate in the research. Even broad consent assumes that participants know what research areas their data will be used for. It is not the case if the datasets contain pictures of people who have no idea about it. 

Another approach would be to say that research with personal data is radically different from research with humans and that the harm of using data without the subjects' approval is minimal if the information is already public. Therefore, researchers are entitled to conduct such research without the subject's consent.  According to this line of argument, benefits arising from the research outweigh the potential harm that, in any case, is minimal. Ethical research is then less about strictly abiding by the principle of voluntary consent and more about balancing interests - privacy on the one hand and the value of research on the other.

Even under the strict European personal data protection framework, processing personal data without consent may be permitted if it serves a "legitimate" or "public" interest. This approach is, however, predicated on the assumption that scientific research benefits society. This essential, although sometimes not explicitly articulated, premise is undercut in the case of research developing controversial technologies that are considered harmful to society. The fact that the photos were shared in the first place without the explicit consent of the people depicted in them raised another red flag. Finally, it’s worth reminding that the Selkie research has revealed that the respondent, by and large, did not consider commercial AI training a legitimate use of their photos.

 It is important to emphasize that voluntary consent has never been the sole safeguard of ethical research with humans. Another essential requirement is for researchers to obtain the approval of a research ethics committee or an institutional review board before collecting data. The role of such committees is vital when there are challenges in obtaining consent from research participants. However, contrary to the field of biomedical ethics, where there are common and well-established standards on how to carry out research and treat the participants, ethics committees or boards might not be equipped to provide sufficient guidance in the areas related to big data processing and the development of facial recognition technologies. Moreover, ethics committees tend to focus on the potential harm to individuals involved in research rather than a project’s impact and its potential to cause damage to society. In some cases, the collection of photographs of faces from the Internet for research purposes does not alarm the ethics committees who assume that the use of public data is not controversial.41 On top of that, research ethics committees exist primarily at academic institutions, which means that research carried out by companies escapes their scrutiny. Taking all of this into consideration, it becomes evident that there are numerous blind spots in the ethical oversight of big data research.

The use of results of research into facial recognition technologies often raises serious ethical and human rights issues, which begs the question of the responsibility of researchers for what happens to their work when it leaves the lab. 

Linked to that is the question is sampling bias42 which can lead to poor generalization of the algorithms. This, in turn, can make them unreliable and ripe for potential fundamental rights abuses. If the data set used to train a face recognition system consists of more photos of light-skinned than dark-skinned faces, the system will not achieve the same accuracy for the latter. Black and other minority ethnic persons and women might be disproportionately misidentified by algorithmically driven identification technology. High disparities in error rates were found between lighter-skinned men and darker-skinned women. For example, research showed that in the case of three commercial gender classifiers, darker-skinned women were up to 44 times more likely to be misclassified than lighter-skinned men. New York Times journalist Steve Lohr aptly summarized this problem when he wrote that facial recognition technology “is accurate if you’re a white guy.” 

Another critical issue arising at this point is the question of associations between universities and tech companies that produce technologies used for mass surveillance. These ties blur the line between research and use. Recently, there has been a more significant push toward raising awareness among scientists about the potential consequences of their work. Still, according to a survey conducted by Nature among 480 researchers who have published papers on facial recognition, researchers remain split on some key ethical issues. These include whether there is a need to obtain informed consent from individuals before using their faces in a facial-recognition data set or whether research that uses facial-recognition software should require prior approval from ethics bodies.43

Ultimately, even though “Flickr” and other similar cases raise many concerns related to consent, the right to withdraw from research, the protection of data protection rights, and misuse of research results, they seem to fall between the cracks of the existing ethical oversight system.

AI_Commons: exploring the governance of the case

The case that is the subject of this white paper points to challenges to both open licensing and to the governance of face recognition datasets. New, broadly understood governance mechanisms are needed to account for the risks, mitigate harms, and preserve the commons that are the pool of openly licensed content. 

Most importantly, the search for solutions cannot be limited to a single approach or perspective, tied to just one of the lenses that we used in this white paper. The novelty of this case comes from the need to balance, through the governance of shared resources, the value generated by the open sharing of resources and their reuse on the one hand, and the protection of both individual and collective rights on the other. 

More generally, the case will not be solved just by judging the legality of the uses. For the same reason, the solutions cannot be found solely in laws that exist in specific jurisdictions (such as TDM exceptions or specific privacy protection laws). Key governance questions concern not the legal status of uses of content, but how responsibility is taken for these uses. 

Responsibility in open ecosystems is distributed, and in many cases weak or non-existent, as these resources freely flow between actors and across different spheres of life and milieus. In the case that we analyzed, photographs flow from the sphere of everyday life, through research milieus, into professional and corporate sites where they are used to build new AI technologies, including those for surveillance and military purposes. “We may trust the entity that gathers or captures that information, and we may have consented to an initial capture, but custody over that dataset leaks,” said Adam Greenfield – a technology writer and urbanist – in a comment for the Financial Times.

A better governance model will assign responsibility to various actors, with varied capacities to structure different phases of the face recognition dataset life cycle. It will also anticipate and prevent the type of harm that we described in this report from occurring in the future.

To launch a conversation on the life cycle governance of the face recognition datasets created from openly licensed photographs, we are ending this white paper with a set of questions that the case raises at each life cycle stage.

Subjects of photographs. Was consent given by the photo subject, and for what uses? Did the photo subject consent to the photo being uploaded and shared under a CC license? Did the photo subject agree to the photo being used in another context? Do we need consent to share photographs openly?

Photographers. Has the photographer considered what uses will be made of the photograph? Was the photographer aware that the photographs could be used to create surveillance technologies? Is the photographer taking into account personality rights when sharing the photographs? Is consent of photo subjects required when uploading the photographs? Is the specificity of photographs of people's faces considered when making licensing decisions?

Online platforms. How do the terms of service (ToS) of the platform structure the use of photographs for face recognition training? Could ToS be used to limit these uses? How can a platform support users in making informed decisions about sharing photographs of people?

CC license steward. How are personality rights addressed (in particular third-party rights)? Does the CC license take into account that the content might be used to create surveillance technologies or other potentially harmful uses? Should CC licenses be adjusted to take into account the responsible use of AI technologies?

Creators and stewards of the datasets. How was the dataset curated? Are the dataset creators considering what uses will be made of the dataset? Are the creators aware that the dataset can be used to create surveillance technologies? Were the photographs and their authors properly attributed? Is the use of the photographs in line with the CC license conditions? Does the creation of the dataset break any laws? Is consent from photo subjects required at this point? Do the dataset creators make sure that the dataset is representative and diverse? Is the dataset openly shared? Are there additional terms of service/licensing terms defined?

Users of the datasets. Is the use of the dataset and the photographs allowed under existing law, for example, TDM exceptions? Do the data set users check if it is representative and diverse? Are there any attempts to govern the use of the datasets among its users?

No comments here
Why not start the discussion?