On Distinctive Image Captioning via Comparing and Reweighting

by   Jiuniu Wang, et al.

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human annotation could result in using common words and phrases, which lacks distinctiveness, i.e., many similar images have the same caption. In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. First, we propose a distinctiveness metric – between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness; however, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions. In contrast, we reweight each ground-truth caption according to its distinctiveness during training. We further integrate a long-tailed weight strategy to highlight the rare words that contain more information, and captions from the similar image set are sampled as negative examples to encourage the generated sentence to be unique. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.


page 2

page 7

page 14

page 16

page 17

page 19

page 20


Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets

A wide range of image captioning models has been developed, achieving si...

Distinctive Image Captioning via CLIP Guided Group Optimization

Image captioning models are usually trained according to human annotated...

Group-based Distinctive Image Captioning with Memory Attention

Describing images using natural language is widely known as image captio...

SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning

Deep neural networks (DNNs) have been recently found popular for image c...

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

The task of image-text matching aims to map representations from differe...

A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning

A hybrid model is proposed that integrates two popular image captioning ...

Improving Audio Captioning Using Semantic Similarity Metrics

Audio captioning quality metrics which are typically borrowed from the m...

Please sign up or login with your details

Forgot password? Click here to reset