Multimodal Few-Shot Learning with Frozen Language Models

06/25/2021
by   Maria Tsimpoukelli, et al.
0

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

READ FULL TEXT

page 2

page 5

page 14

page 15

page 16

page 18

page 19

research
04/29/2022

Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to numerous tasks using only...
research
02/28/2023

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap ...
research
04/16/2021

Does language help generalization in vision models?

Vision models trained on multimodal datasets have recently proved very e...
research
02/02/2023

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Recent progress in scaling up large language models has shown impressive...
research
07/03/2023

Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-l...
research
09/05/2023

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Large language models (LLMs) like ChatGPT have revealed amazing intellig...
research
02/03/2023

Controlling for Stereotypes in Multimodal Language Model Evaluation

We propose a methodology and design two benchmark sets for measuring to ...

Please sign up or login with your details

Forgot password? Click here to reset