Flamingo: a Visual Language Model for Few-Shot Learning

04/29/2022
by   Jean-Baptiste Alayrac, et al.
7

Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

READ FULL TEXT

page 1

page 2

page 3

page 9

page 31

page 32

page 33

page 34

research
06/25/2021

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibi...
research
05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...
research
02/28/2023

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap ...
research
02/19/2023

Few-shot Multimodal Multitask Multilingual Learning

While few-shot learning as a transfer learning paradigm has gained signi...
research
06/02/2023

Probabilistic Adaptation of Text-to-Video Models

Large text-to-video models trained on internet-scale data have demonstra...
research
12/28/2017

Learning Rapid-Temporal Adaptations

A hallmark of human intelligence and cognition is its flexibility. One o...
research
07/06/2023

Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?

Numerous benchmarks for Few-Shot Learning have been proposed in the last...

Please sign up or login with your details

Forgot password? Click here to reset