DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

by   Dongsheng Xu, et al.

Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual object concepts as auxiliary information. Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set. Our code will be publicly available.


page 1

page 3

page 4

page 8


Question-controlled Text-aware Image Captioning

For an image with multiple scene texts, different people may be interest...

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Existing research for image captioning usually represents an image using...

Dense Relational Image Captioning via Multi-task Triple-Stream Networks

We introduce dense relational captioning, a novel image captioning task ...

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

When describing an image, reading text in the visual scene is crucial to...

MOC-GAN: Mixing Objects and Captions to Generate Realistic Images

Generating images with conditional descriptions gains increasing interes...

Knowledge driven Description Synthesis for Floor Plan Interpretation

Image captioning is a widely known problem in the area of AI. Caption ge...

Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

A framework performing Visual Commonsense Reasoning(VCR) needs to choose...

Please sign up or login with your details

Forgot password? Click here to reset