BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics

by   Liang Ma, et al.

The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., decrease as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error (from an ontology of 7 types) is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) BUMP enables measuring the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, 3) BUMP enables the measurement of metrics' performance on individual error types and highlights areas of weakness for future work.


page 6

page 7

page 17


Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Modern summarization models generate highly fluent but often factually u...

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

Practical applications of abstractive summarization models are limited b...

How Far are We from Robust Long Abstractive Summarization?

Abstractive summarization has made tremendous progress in recent years. ...

Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a larg...

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

The propensity of abstractive summarization systems to make factual erro...

Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization o...

An End-to-End Workflow using Topic Segmentation and Text Summarisation Methods for Improved Podcast Comprehension

The consumption of podcast media has been increasing rapidly. Due to the...

Please sign up or login with your details

Forgot password? Click here to reset