Recent breakthroughs in language modeling are powered by large collections of natural language datasets. This has triggered an arms race to train models on disparate collections of incorrectly, ambiguously, or under-documented data that has left practitioners unsure of the ethical and legal risks.
To address this, the Data Provenance Initiative has created a mapping of 2000+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.