Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

by   Arjun R Akula, et al.

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7 structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12 in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv


page 3

page 5

page 9

page 10


Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between v...

Target Features Affect Visual Search, A Study of Eye Fixations

Visual Search is referred to the task of finding a target object among a...

RefCrowd: Grounding the Target in Crowd with Referring Expressions

Crowd understanding has aroused the widespread interest in vision domain...

Dynamic Multimodal Instance Segmentation guided by natural language queries

In this paper, we address the task of segmenting an object given a natur...

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

The task of visual grounding requires locating the most relevant region ...

A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution

Pronoun Coreference Resolution (PCR) is the task of resolving pronominal...

Counting with Adaptive Auxiliary Learning

This paper proposes an adaptive auxiliary task learning based approach f...

Please sign up or login with your details

Forgot password? Click here to reset