Are Large Language Models Good Evaluators for Abstractive Summarization?

by   Chenhui Shen, et al.

Human evaluations are often required for abstractive summary evaluations to give fairer judgments. However, they are often time-consuming, costly, inconsistent, and non-reproducible. To overcome these challenges, we explore the potential of using an out-of-the-box LLM (i.e. "gpt-3.5-turbo") for summarization evaluation without manually selecting demonstrations or complex prompt tuning. We compare different evaluation methods, including 2 methods for Likert-scale scoring and 1 method for head-to-head comparisons, to investigate the performance of the LLM as a zero-shot evaluator. We further propose a meta-correlation metric to measure the stability of the LLM's evaluation capability. With extensive experiments, we show that certain prompt formats can produce better results than others. We also bring attention to the LLM's deteriorating evaluation capability with the rising qualities of summaries. In addition, we find that the LLM's evaluation capability also depends on the evaluated dimensions. We discuss the pros and cons of each method, make recommendations, and suggest some future directions for improvement.


page 1

page 2

page 3

page 4


Benchmarking Large Language Models for News Summarization

Large language models (LLMs) have shown promise for automatic summarizat...

Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a larg...

Multi-Dimensional Evaluation of Text Summarization with In-Context Learning

Evaluation of natural language generation (NLG) is complex and multi-dim...

Zero-Shot Opinion Summarization with GPT-3

Very large language models such as GPT-3 have shown impressive performan...

TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization

Recent pre-trained language models (PLMs) achieve promising results in e...

Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization o...

Learning and Evaluating Human Preferences for Conversational Head Generation

A reliable and comprehensive evaluation metric that aligns with manual p...

Please sign up or login with your details

Forgot password? Click here to reset