Image Retrieval from Contextual Descriptions

by   Benno Krojer, et al.

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans. Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.


page 1

page 5

page 7

page 9

page 13

page 14

page 15


Is multi-modal vision supervision beneficial to language?

Vision (image and video) - Language (VL) pre-training is the recent popu...

Language as the Medium: Multimodal Video Classification through text only

Despite an exciting new wave of multimodal machine learning models, curr...

Fine-grained Audible Video Description

We explore a new task for audio-visual-language modeling called fine-gra...

DesCo: Learning Object Recognition with Rich Language Descriptions

Recent development in vision-language approaches has instigated a paradi...

A fine-grained comparison of pragmatic language understanding in humans and language models

Pragmatics is an essential part of communication, but it remains unclear...

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Recent visuolinguistic pre-trained models show promising progress on var...

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Pretrained Vision-Language Models (VLMs) have achieved remarkable perfor...

Please sign up or login with your details

Forgot password? Click here to reset