MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

12/09/2021
by   Constantin Eichenberg, et al.
0

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2 SimVLM.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 7

research
08/24/2021

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representat...
research
04/15/2021

Generating Datasets with Pretrained Language Models

To obtain high-quality sentence embeddings from pretrained language mode...
research
07/31/2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is in...
research
07/17/2021

Generative Pretraining for Paraphrase Evaluation

We introduce ParaBLEU, a paraphrase representation learning model and ev...
research
10/21/2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Recent advances in vision-and-language modeling have seen the developmen...
research
05/21/2023

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

The convergence of text, visual, and audio data is a key step towards hu...
research
11/02/2018

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Pretraining with language modeling and related unsupervised tasks has re...

Please sign up or login with your details

Forgot password? Click here to reset