Retrieval-Augmented Multimodal Language Modeling

by   Michihiro Yasunaga, et al.

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant knowledge fetched by a retriever from external memory (e.g., multimodal documents on the web). Specifically, we implement a retriever using the pretrained CLIP model and a generator using the CM3 Transformer architecture, and train this model using the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate mixtures of text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30 capabilities such as knowledge-intensive image generation and multimodal in-context learning.


page 1

page 8

page 9

page 10

page 11


MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

While language Models store a massive amount of world knowledge implicit...

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Research on text-to-image generation has witnessed significant progress ...

Retrieving Multimodal Information for Augmented Generation: A Survey

In this survey, we review methods that retrieve multimodal knowledge to ...

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Augmenting pretrained language models (LMs) with a vision encoder (e.g.,...

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

This paper studies multi-task training of retrieval-augmented generation...

DataComp: In search of the next generation of multimodal datasets

Large multimodal datasets have been instrumental in recent breakthroughs...

The Web Can Be Your Oyster for Improving Large Language Models

Large language models (LLMs) encode a large amount of world knowledge. H...

Please sign up or login with your details

Forgot password? Click here to reset