Finding a Balanced Degree of Automation for Summary Evaluation

by   Shiyue Zhang, et al.

Human evaluation for summarization tasks is reliable but brings in issues of reproducibility and high costs. Automatic metrics are cheap and reproducible but sometimes poorly correlated with human judgment. In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s) but replaces the manual work of judging SCUs' presence in system summaries with a natural language inference (NLI) model. Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs) via a semantic role labeling (SRL) model. Finally, we propose in-between metrics, Lite2.xPyramid, where we use a simple regressor to predict how well the STUs can simulate SCUs and retain SCUs that are more difficult to simulate, which provides a smooth transition and balance between automation and manual evaluation. Comparing to 15 existing metrics, we evaluate human-metric correlations on 3 existing meta-evaluation datasets and our newly-collected PyrXSum (with 100/10 XSum examples/systems). It shows that Lite2Pyramid consistently has the best summary-level correlations; Lite3Pyramid works better than or comparable to other automatic metrics; Lite2.xPyramid trades off small correlation drops for larger manual effort reduction, which can reduce costs for future data collection. Our code and data are publicly available at:


Evaluating and Improving Factuality in Multimodal Abstractive Summarization

Current metrics for evaluating factuality for abstractive document summa...

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Interpretability and efficiency are two important considerations for the...

Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation

Conducting a manual evaluation is considered an essential part of summar...

Automated Evaluation of Standardized Dementia Screening Tests

For dementia screening and monitoring, standardized tests play a key rol...

Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation

Despite significant progress has been achieved in text summarization, fa...

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

The problems of unfaithful summaries have been widely discussed under th...

On conducting better validation studies of automatic metrics in natural language generation evaluation

Natural language generation (NLG) has received increasing attention, whi...

Please sign up or login with your details

Forgot password? Click here to reset