A Survey of Parameters Associated with the Quality of Benchmarks in NLP

by   Swaroop Mishra, et al.

Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues – a metric quantifying quality – remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark. We look for bias related parameters which can potentially help pave our way towards the metric. We survey existing works and identify parameters capturing various properties of bias, their origins, types and impact on performance, generalization, and robustness. Our analysis spans over datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring that our parameters are generic and are not overfitted towards a specific task or dataset. We also develop certain parameters in this process.


page 2

page 9


DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...

DQI: A Guide to Benchmark Evaluation

A `state of the art' model A surpasses humans in a benchmark B, but fail...

SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases

Recent studies reveal that various biases exist in different NLP tasks, ...

The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

How reliably can we trust the scores obtained from social bias benchmark...

A Survey on Techniques for Identifying and Resolving Representation Bias in Data

The grand goal of data-driven decision-making is to help humans make dec...

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in be...

Text Characterization Toolkit

In NLP, models are usually evaluated by reporting single-number performa...

Please sign up or login with your details

Forgot password? Click here to reset