Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

09/01/2023
by   Joshua Feinglass, et al.
0

Object proposal generation serves as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). The performance of object proposals generated for VL tasks is currently evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. Importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Lastly, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.

READ FULL TEXT

page 1

page 3

page 5

page 7

research
09/04/2019

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Object detection plays an important role in current solutions to vision ...
research
09/22/2019

Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators

Grounding language to visual relations is critical to various language-a...
research
12/09/2021

Injecting Semantic Concepts into End-to-End Image Captioning

Tremendous progress has been made in recent years in developing better i...
research
04/01/2020

More Grounded Image Captioning by Distilling Image-Text Matching Model

Visual attention not only improves the performance of image captioners, ...
research
10/03/2022

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Automatic Audio Captioning (AAC) refers to the task of translating an au...
research
04/30/2018

Improved Image Captioning with Adversarial Semantic Alignment

In this paper we propose a new conditional GAN for image captioning that...
research
08/08/2020

Assisting Scene Graph Generation with Self-Supervision

Research in scene graph generation has quickly gained traction in the pa...

Please sign up or login with your details

Forgot password? Click here to reset