In the struggle over who can train AI models and how, there’s a casualty many people don’t realize: The open web.
The web’s seemingly infinite content — a product of decades of thought and millions of minds — also provides a crucial training corpus for chatbots and other AI models. But in an effort to protect their work from unauthorized use, many content creators, from big news publications to smaller blogs, are closing down access by disallowing web crawling from their sites. This limits access for big AI companies, but also for researchers working in the public interest, archival projects that trace the history of the internet, and independent developers who deserve equitable access.
This is the thorny challenge at the core of “Consent in Crisis,” a new research paper authored by the Data Provenance Initiative (DPI), a volunteer research collective spanning more than 50 contributors and 15 countries.
---
Mozilla spoke with the study leads Shayne Longpre, Robert Mahari, Ariel Lee, and Campbell Lund about the new research, what this means for the open web, and what’s next for DPI.