AnnoTool: Crowdsourcing for Natural Language Corpus Creation

Hayden, K. "AnnoTool: Crowdsourcing for Natural Language Corpus Creation"


This thesis explores the extent to which untrained annotators can create annotated corpora of scientific texts. Currently the variety and quantity of annotated corpora are limited by the expense of hiring or training annotators. The expense for finding and hiring professionals increases as the task becomes more esoteric or requiring of a specialized skill set. Training annotators is an investment in itself, often difficult to justify. Undergraduate students or volunteers may not remain with a project for long enough after being trained and graduate students' time may already be prioritized for other research goals.

As the demand increases for computer programs capable of interacting with users through natural language, producing annotated datasets with which to train these programs is becoming increasingly important.

This thesis presents an approach combining crowdsourcing with Luis von Ahn's "games with a purpose" paradigm. Crowdsourcing combines contributions from many participants in an online community. Games with a purpose incentivize voluntary contributions by providing an avenue for a task people are already incentivized to do, and collect data in the background. Here the desired data are annotations and the target community people annotating text for professional or personal benefit, such as scientists, researchers or the general public with an interest in science.

An annotation tool was designed in the form of a Google Chrome extension specifically built to work with articles from the open-access, online scientific journal Public Library of Science (PLOS) ONE. A study was designed where participants with no prior annotator training were given a brief introduction to the annotation tool and assigned to annotate three articles. The results of the study demonstrate considerable annotator agreement. The results of this thesis demonstrate that crowdsourcing annotations is feasible even for technically sophisticated texts and presents a model of a platform that continuously gathers annotated corpora.

Related Content