A massive audit of 1800+ AI text datasets, tracing their provenance and composition from origin to creation.
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners, and the models they train. To remedy these dubious practices in data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ dataset licenses, license conditions, original sources, creators, collection times, languages, tasks, topics, metrics, domains, and usage. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, as well as newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 68%+ and error rates of 50%+. This points to a crisis in misattribution, and informed use of the most popular datasets, driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for over 45 of the most popular finetuning data collections, or 1800+ supervised datasets, we analyze in this work: https://www.dataprovenance.org/.