Article

AI Training Can Undermine the Open Web. This Team Is Thinking Through Solutions

via Mozilla

July 23, 2024

Topics

People

Projects

Data Provenance for AI

Groups

Share this article

In the struggle over who can train AI models and how, there’s a casualty many people don’t realize: The open web.

The web’s seemingly infinite content — a product of decades of thought and millions of minds — also provides a crucial training corpus for chatbots and other AI models. But in an effort to protect their work from unauthorized use, many content creators, from big news publications to smaller blogs, are closing down access by disallowing web crawling from their sites. This limits access for big AI companies, but also for researchers working in the public interest, archival projects that trace the history of the internet, and independent developers who deserve equitable access.

This is the thorny challenge at the core of “Consent in Crisis,” a new research paper authored by the Data Provenance Initiative (DPI), a volunteer research collective spanning more than 50 contributors and 15 countries.

---

Mozilla spoke with the study leads Shayne Longpre, Robert Mahari, Ariel Lee, and Campbell Lund about the new research, what this means for the open web, and what’s next for DPI.

Read on Mozilla

A Safe Harbor for AI Evaluation and Red Teaming

Artificial Intelligence (cs.AI), arXiv:2403.04893 [cs.AI] (or arXiv:2403.04893v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2403.04893 | Shayne Longpre* 1 Sayash Kapoor** 2 Kevin Klyman** 3 Ashwin Ramaswami 4 Rishi Bommasani 3 Borhane Blili-Hamelin 5 Yangsibo Huang 2 Aviya Skowron 6 Zheng-Xin Yong 7 Suhas Kotha 8 Yi Zeng 9 Weiyan Shi 10 Xianjun Yang 11 Reid Southen Alexander Robey

Publication Research

AI Training Can Undermine the Open Web. This Team Is Thinking Through Solutions

Topics

People

Projects

Groups

A Safe Harbor for AI Evaluation and Red Teaming

A large-scale audit of dataset licensing and attribution in AI

Study: Transparency is often lacking in datasets used to train large language models

Data Provenance Explorers

AI Training Can Undermine the Open Web. This Team Is Thinking Through Solutions

Topics

People

Projects

Groups

Share this article

A Safe Harbor for AI Evaluation and Red Teaming

A large-scale audit of dataset licensing and attribution in AI

Study: Transparency is often lacking in datasets used to train large language models

Data Provenance Explorers