Learning to Evaluate Image Captioning

by   Yin Cui, et al.

Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.


page 6

page 11

page 12


Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder

Automatically evaluating the quality of image captions can be very chall...

RoViST:Learning Robust Metrics for Visual Storytelling

Visual storytelling (VST) is the task of generating a story paragraph th...

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

This paper presents a new metric called TIGEr for the automatic evaluati...

SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis

The open-ended nature of visual captioning makes it a challenging area f...

WEmbSim: A Simple yet Effective Metric for Image Captioning

The area of automatic image caption evaluation is still undergoing inten...

Can Audio Captions Be Evaluated with Image Caption Metrics?

Automated audio captioning aims at generating textual descriptions for a...

Meta Pattern Concern Score: A Novel Metric for Customizable Evaluation of Multi-classification

Classifiers have been widely implemented in practice, while how to evalu...

Please sign up or login with your details

Forgot password? Click here to reset