Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

03/08/2023
by   Serge Gladkoff, et al.
0

In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce “Student's t-Distribution” method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give quantitative analysis on how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student's t-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This t-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce. Keywords: Inter-Rater Reliability (IRR); Scarce Observations; Confidence Intervals (CIs); Natural Language Processing (NLP); Translation Quality Evaluation (TQE); Student's t-Distribution

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2015

Posterior calibration and exploratory analysis for natural language processing models

Many models in natural language processing define probabilistic distribu...
research
10/19/2022

Enrichment Score: a better quantitative metric for evaluating the enrichment capacity of molecular docking models

The standard quantitative metric for evaluating enrichment capacity know...
research
08/25/2023

Leveraging Knowledge and Reinforcement Learning for Enhanced Reliability of Language Models

The Natural Language Processing(NLP) community has been using crowd sour...
research
09/13/2021

Uncertainty-Aware Machine Translation Evaluation

Several neural-based metrics have been recently proposed to evaluate mac...
research
09/08/2019

Transformer to CNN: Label-scarce distillation for efficient text classification

Significant advances have been made in Natural Language Processing (NLP)...
research
03/24/2022

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Since the inception of crowdsourcing, aggregation has been a common stra...
research
09/22/2021

Estimating the number of serial killers that were never caught

Many serial killers commit tens of murders. At the same time inter-murde...

Please sign up or login with your details

Forgot password? Click here to reset