Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

08/21/2019
by   Mor Geva, et al.
0

Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years. A common crowdsourcing practice is to recruit a small number of high-quality workers, and have them massively generate examples. Having only a few workers generate the majority of examples raises concerns about data diversity, especially when workers freely generate sentences. In this paper, we perform a series of experiments showing these concerns are evident in three recent NLP datasets. We show that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators. Moreover, we show that often models do not generalize well to examples from annotators that did not contribute to the training set. Our findings suggest that annotator bias should be monitored during dataset creation, and that test set annotators should be disjoint from training set annotators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/03/2015

Cheaper and Better: Selecting Good Workers for Crowdsourcing

Crowdsourcing provides a popular paradigm for data collection at scale. ...
research
05/03/2020

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

This position paper describes and critiques the Pretraining-Agnostic Ide...
research
09/19/2021

Training Dynamic based data filtering may not work for NLP datasets

The recent increase in dataset size has brought about significant advanc...
research
05/01/2022

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

In recent years, progress in NLU has been driven by benchmarks. These be...
research
04/13/2022

Fast Few-shot Debugging for NLU Test Suites

We study few-shot debugging of transformer based natural language unders...
research
04/07/2020

Evaluating Machines by their Real-World Language Use

There is a fundamental gap between how humans understand and use languag...
research
11/01/2014

How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels

Crowdsourcing has been part of the IR toolbox as a cheap and fast mechan...

Please sign up or login with your details

Forgot password? Click here to reset