Distributed Attention for Grounded Image Captioning

by   Nenglun Chen, et al.

We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed attention mechanism to enforce the network to aggregate information from multiple spatially different regions with consistent semantics while generating the words. Therefore, the union of the focused region proposals should form a visual region that encloses the object of interest completely. Extensive experiments have demonstrated the superiority of our proposed method compared with the state-of-the-arts.


page 1

page 2

page 4

page 6

page 7

page 8


Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Weakly supervised grounded image captioning (WSGIC) aims to generate the...

More Grounded Image Captioning by Distilling Image-Text Matching Model

Visual attention not only improves the performance of image captioners, ...

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...

Areas of Attention for Image Captioning

We propose "Areas of Attention", a novel attention-based model for autom...

Learning to Generate Grounded Image Captions without Localization Supervision

When generating a sentence description for an image, it frequently remai...

Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition

The goal of unpaired image captioning (UIC) is to describe images withou...

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Recently, increasing efforts have been focused on Weakly Supervised Scen...

Please sign up or login with your details

Forgot password? Click here to reset