Researchers from MIT, Cohere for AI, and 11 other institutions launch free Data Provenance Explorer platform to track and filter audited datasets for ethical, legal, and transparency considerations

By Shayne Longpre and Sara Hooker

The rapid adoption and scaling of AI technology trained on diverse, often poorly documented, datasets has led to a growing data transparency crisis. Inconsistent tracking of the origins of a piece of data, which we refer to as data provenance, has introduced a range of legal and ethical challenges. Further complicating matters, rules governing data usage remain poorly defined, resulting in a web of confusion for developers, scholars, and policy makers in the space.

To address this challenge, a multi-disciplinary effort made up of machine learning (ML) and legal experts from MIT, Cohere For AI, and 11 other institutions, have joined forces to audit and trace nearly 2,000 of the most widely used fine-tuning datasets. Collectively, these data sets have been downloaded tens of millions of times, and are the backbone of many published NLP breakthroughs.

The result of this multidisciplinary initiative is the single largest audit to date of AI datasets. For the first time, these datasets include tags to the original data sources, numerous re-licensings, creators, and other data properties.

To make this information practical and accessible, the group has launched an interactive platform, the Data Provenance Explorer. The platform allows developers to track and filter thousands of datasets for legal and ethical considerations, and enables scholars and journalists to explore the composition and data lineage of popular AI datasets.

The details of this collective effort can be found in a new paper, The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.

Addressing the Data Transparency Crisis

Poor data provenance has broad and potentially long-term implications for AI progress in both industry and research. Our extensive audit of open source datasets finds that several factors contribute to the current data transparency crisis. Crowdsourced aggregators like GitHub, Papers with Code, and many of the open source LLMs trained from data on these aggregators, have an extremely high proportion of missing data licenses (“Unspecified”), ranging from 72 to 83 percent. This compares to 30 percent missing licenses with our own annotation protocol, which categorizes licenses for datasets based on legal guidance.

In addition, the licenses that are assigned by crowdsourced aggregators frequently allow broader use than the original intent expressed by the authors of a dataset. Crowdsourced aggregators listed licenses too permissively in up to 29 percent of the various cases audited. Incorrect or distorted licensing leads to cascading misattribution, as datasets are re-packaged in more extensive collections under incorrect licenses.

Overall, the ecosystem analysis found systemic problems across data provenance practices, including the use of sparse, ambiguous, or incorrect license documentation. Simply put, even if a practitioner wants to do the right thing and responsibly attribute, they are ill-equipped to navigate opaque and often mislabelled datasets, leaving themselves open to a variety of risks.

Understanding Systematic Differences in Commercially-Available Data

Auditing the most widely adopted datasets allows us to understand how licenses influence access to data, and who creates the data that shapes the technology around us. The audit identified several key trends that impact access and safety of models trained on these widely leveraged datasets.

There is a sharp and widening divide between data licensed as commercially open versus closed. A growing share of publicly released datasets are not licensed for commercial use. While this does not impact research efforts, it does set limitations for smaller, early-stage companies that do not otherwise have access to large datasets.

Data with wider task and topic diversity, as well as those useful for longer-form generations, like long articles or books, are typically restricted for non-commercial uses. Rising restrictions on the highest-quality and most useful data for many applications, can lead to a widening gap between the quality of data available for commercial use.

There is also a geographic skew in available datasets. For all of the datasets we analyze, we trace the geographic coverage according to language. We find a stark Western-centric skew in representation across datasets. Asian, African, and South American nations are sparsely covered, if at all. The resulting models are likely to have inherent bias, underperforming in critical ways for users of models outside of the West.

Our research on the data transparency crisis reveals an additional set of challenges on access and safety. Practitioners are faced with limited commercially available datasets, and of those available, they skew toward a Western language representation.

Open Legal Ambiguities

While the Data Provenance Explorer enables better transparency when licenses are in tension, major legal ambiguities remain in data licensing that the authors acknowledge cannot be resolved by tooling alone.

Geographic divergence in legal frameworks. Different jurisdictions have different, and evolving, laws. It can be challenging for practitioners to determine which laws should apply to a given machine learning project when the relevant rules vary between the locations where the data is collected and downloaded, where the model was trained, and where the model was deployed. This inconsistency creates practical challenges, and may ultimately slow or hinder the development of the industry.

Licenses used for datasets are often ill-suited. Most open-source licenses were designed for software, but are increasingly being applied with no modifications to datasets. There are also issues with bundling of datasets, where individual datasets, each potentially governed by a different license, are amalgamated into collections. If the requirements of the underlying license agreements are irreconcilable, such as different copyleft requirements, this makes it extremely hard for developers to use certain collections while respecting all license terms.

These often conflicting legal frameworks and reliance on licenses that were not created with data in mind make responsible stewardship of data by practitioners difficult to navigate.

Path Forward

The lack of data transparency and inconsistency in licensing pose a significant challenge for model developers, researchers, and everyday practitioners. In aggregate, these practices are creating an ethical, legal, and transparency crisis.

The open sourcing of the Data Provenance Explorer, and the accompanying repository for practitioners to download the data filtered for license conditions, mark an important step forward for data transparency and reliable provenance. This large-scale audit is the launch of a wider multi-institutional initiative, where users all over the world can contribute to the explorer, setting a path forward to improve transparency in data licensing and responsible use.

About the Authors

Shayne Longpre is a PhD candidate at MIT with a focus on data-centric AI, its governance and impact.

Sara Hooker leads Cohere For AI, a research lab that seeks to solve complex machine learning problems.

The Data Provenance Paper and Explorer tool was the result of a cross-institutional and cross-disciplinary collaboration involving experts from 13 institutions. The full list of authors and paper can be found here.