Learning to Relate from Captions and Bounding Boxes

12/01/2019
by   Sarthak Garg, et al.
11

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15 25 successfully predicts relations that are not present in the corresponding captions.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset