COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

by   Arijit Ray, et al.

Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.


page 16

page 17

page 18

page 19

page 23

page 27

page 29

page 30


Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models

Extending language models with structural modifications and vision-and-l...

Learning Concise and Descriptive Attributes for Visual Recognition

Recent advances in foundation models present new opportunities for inter...

Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and Formal Specifications

We motivate and formally define a new task for fine-tuning rule-like gen...

When and why vision-language models behave like bags-of-words, and what to do about it?

Despite the success of large vision and language models (VLMs) in many d...

Text encoders are performance bottlenecks in contrastive vision-language models

Performant vision-language (VL) models like CLIP represent captions usin...

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual...

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ...

Please sign up or login with your details

Forgot password? Click here to reset