Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

01/16/2023
by   Zhiqiu Lin, et al.
0

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

READ FULL TEXT

page 5

page 17

research
05/29/2023

Deeply Coupled Cross-Modal Prompt Learning

Recent advancements in multimodal foundation models (e.g., CLIP) have ex...
research
11/17/2020

Multimodal Prototypical Networks for Few-shot Learning

Although providing exceptional results for many computer vision tasks, s...
research
07/31/2022

Cross-Modal Alignment Learning of Vision-Language Conceptual Systems

Human infants learn the names of objects and develop their own conceptua...
research
02/19/2019

Adaptive Cross-Modal Few-Shot Learning

Metric-based meta-learning techniques have successfully been applied to ...
research
07/01/2023

SHARCS: Shared Concept Space for Explainable Multimodal Learning

Multimodal learning is an essential paradigm for addressing complex real...
research
06/13/2018

Cross-modal Hallucination for Few-shot Fine-grained Recognition

State-of-the-art deep learning algorithms generally require large amount...
research
11/09/2018

Multimodal One-Shot Learning of Speech and Images

Imagine a robot is shown new concepts visually together with spoken tags...

Please sign up or login with your details

Forgot password? Click here to reset