Evaluating and Improving Factuality in Multimodal Abstractive Summarization

by   David Wan, et al.

Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ


page 1

page 2

page 3

page 4


Finding a Balanced Degree of Automation for Summary Evaluation

Human evaluation for summarization tasks is reliable but brings in issue...

Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation

Despite significant progress has been achieved in text summarization, fa...

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...

Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization

The goal of multimodal abstractive summarization (MAS) is to produce a c...

Summarization from Leaderboards to Practice: Choosing A Representation Backbone and Ensuring Robustness

Academic literature does not give much guidance on how to build the best...

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Interpretability and efficiency are two important considerations for the...

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Most research about natural language generation (NLG) relies on evaluati...

Please sign up or login with your details

Forgot password? Click here to reset