Large Language Models are Diverse Role-Players for Summarization Evaluation

by   Ning Wu, et al.

Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.


page 1

page 2

page 3

page 4


MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification

In text summarization and simplification, system outputs must be evaluat...

Evaluation of Automatic Text Summarization using Synthetic Facts

Despite some recent advances, automatic text summarization remains unrel...

Multi-Dimensional Evaluation of Text Summarization with In-Context Learning

Evaluation of natural language generation (NLG) is complex and multi-dim...

Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports

Neural abstractive summarization models are able to generate summaries w...

UMSE: Unified Multi-scenario Summarization Evaluation

Summarization quality evaluation is a non-trivial task in text summariza...

Revisiting Sentence Union Generation as a Testbed for Text Consolidation

Tasks involving text generation based on multiple input texts, such as m...

Assessing The Factual Accuracy of Generated Text

We propose a model-based metric to estimate the factual accuracy of gene...

Please sign up or login with your details

Forgot password? Click here to reset