"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

Well-formed context aware image captions and tags in enterprise content such as marketing material are critical to ensure their brand presence and content recall. Manual creation and updates to ensure the same is non trivial given the scale and the tedium towards this task. We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model, with a focus on context-assisted image captioning where the caption is generated based on both the image and its context. Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches. We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context. The second pretraining task is a new VL task, and we construct and release two datasets for the task with 1.1M and 2.2K data instances. Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets. To the best of our knowledge, ours is the first effort at incorporating contextual information in pretraining the models for the VL tasks.


page 1

page 12


Journalistic Guidelines Aware News Image Captioning

The task of news article image captioning aims to generate descriptive a...

Good News, Everyone! Context driven entity-aware captioning for news images

Current image captioning systems perform at a merely descriptive level, ...

Towards Self-Explainability of Deep Neural Networks with Heatmap Captioning and Large-Language Models

Heatmaps are widely used to interpret deep neural networks, particularly...

Visually-Aware Context Modeling for News Image Captioning

The goal of News Image Captioning is to generate an image caption accord...

Image Captioning with Context-Aware Auxiliary Guidance

Image captioning is a challenging computer vision task, which aims to ge...

Focus! Relevant and Sufficient Context Selection for News Image Captioning

News Image Captioning requires describing an image by leveraging additio...

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

The Multimedia and Computer Vision Lab of the University of Augsburg par...

Please sign up or login with your details

Forgot password? Click here to reset