Does language help generalization in vision models?

04/16/2021
by   Benjamin Devillers, et al.
14

Vision models trained on multimodal datasets have recently proved very efficient, both in terms of the wide availability of large image-caption datasets, and in terms of the resulting model's ability to generalize to multiple downstream tasks (e.g. zero-shot learning). One might assume that these abilities are derived, at least in part, from a "semantic grounding" of the visual feature space, learning meaningful structure by mirroring the space of linguistic representations. Contrary to this intuition, we show that a visual model (BiT-M) trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization (few-shot learning, unsupervised clustering) as its multimodal counterpart (CLIP). When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as "linguistic" as those of CLIP. Overall, these findings suggest that the main factor driving improvements of generalization in current models is the size of the training dataset, not (solely) the multimodal grounding property.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2021

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibi...
research
04/16/2022

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...
research
07/03/2023

Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-l...
research
06/04/2023

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

Multimodal embeddings aim to enrich the semantic information in neural r...
research
02/28/2023

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap ...
research
09/28/2022

Less is More: Rethinking Few-Shot Learning and Recurrent Neural Nets

The statistical supervised learning framework assumes an input-output se...
research
04/04/2022

On Explaining Multimodal Hateful Meme Detection Models

Hateful meme detection is a new multimodal task that has gained signific...

Please sign up or login with your details

Forgot password? Click here to reset