Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

by   Chia-Wen Kuo, et al.

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5 +1.3


page 1

page 4

page 5

page 7


VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training

Vision-and-language pre-trained models (VLMs) have achieved tremendous s...

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

Medical image captioning automatically generates a medical description t...

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...

Thinking Hallucination for Video Captioning

With the advent of rich visual representations and pre-trained language ...

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

A great deal of progress has been made in image captioning, driven by re...

Detecting Hate Speech in Multi-modal Memes

In the past few years, there has been a surge of interest in multi-modal...

Please sign up or login with your details

Forgot password? Click here to reset