Survey Equivalence: A Procedure for Measuring Classifier Accuracy Against Human Labels

06/02/2021
by   Paul Resnick, et al.
10

In many classification tasks, the ground truth is either noisy or subjective. Examples include: which of two alternative paper titles is better? is this comment toxic? what is the political leaning of this news article? We refer to such tasks as survey settings because the ground truth is defined through a survey of one or more human raters. In survey settings, conventional measurements of classifier accuracy such as precision, recall, and cross-entropy confound the quality of the classifier with the level of agreement among human raters. Thus, they have no meaningful interpretation on their own. We describe a procedure that, given a dataset with predictions from a classifier and K ratings per item, rescales any accuracy measure into one that has an intuitive interpretation. The key insight is to score the classifier not against the best proxy for the ground truth, such as a majority vote of the raters, but against a single human rater at a time. That score can be compared to other predictors' scores, in particular predictors created by combining labels from several other human raters. The survey equivalence of any classifier is the minimum number of raters needed to produce the same expected score as that found for the classifier.

READ FULL TEXT

page 5

page 8

research
10/08/2012

Semisupervised Classifier Evaluation and Recalibration

How many labeled examples are needed to estimate a classifier's performa...
research
09/24/2018

Empirical Methodology for Crowdsourcing Ground Truth

The process of gathering ground truth data through human annotation is a...
research
01/02/2023

In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise

The performance of the Deep Learning (DL) models depends on the quality ...
research
10/12/2021

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Majority voting and averaging are common approaches employed to resolve ...
research
06/13/2018

Explainable Agreement through Simulation for Tasks with Subjective Labels

The field of information retrieval often works with limited and noisy da...
research
01/14/2022

Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark

In this paper, we introduce an important yet relatively unexplored NLP t...
research
10/28/2022

An Approach for Noisy, Crowdsourced Datasets Utilizing Ensemble Modeling, 'Human Softmax' Distributions, and Entropic Measures of Uncertainty

Noisy, crowdsourced image datasets prove challenging, even for the best ...

Please sign up or login with your details

Forgot password? Click here to reset