DQI: A Guide to Benchmark Evaluation

by   Swaroop Mishra, et al.

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.


page 4

page 6

page 12


A Survey of Parameters Associated with the Quality of Benchmarks in NLP

Several benchmarks have been built with heavy investment in resources to...

Towards GAN Benchmarks Which Require Generalization

For many evaluation metrics commonly used as benchmarks for unconditiona...

Black-Box Optimization Revisited: Improving Algorithm Selection Wizards through Massive Benchmarking

Existing studies in black-box optimization suffer from low generalizabil...

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in be...

AI Evaluation: past, present and future

Artificial intelligence develops techniques and systems whose performanc...

Learning to Solve Complex Tasks by Talking to Agents

Humans often solve complex problems by interacting (in natural language)...

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

With the continuous evolution and refinement of LLMs, they are endowed w...

Please sign up or login with your details

Forgot password? Click here to reset