MAGVLT: Masked Generative Vision-and-Language Transformer

03/21/2023
by   Sungwoong Kim, et al.
0

While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.

READ FULL TEXT

page 3

page 6

page 15

page 16

page 17

page 18

page 19

page 20

research
12/31/2021

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Conventional methods for the image-text generation tasks mainly tackle t...
research
07/11/2023

Generative Pretraining in Multimodality

We present Emu, a Transformer-based multimodal foundation model, which c...
research
05/26/2019

TIGS: An Inference Algorithm for Text Infilling with Gradient Search

Text infilling is defined as a task for filling in the missing part of a...
research
04/15/2022

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Though deep generative models have gained a lot of attention, most of th...
research
01/19/2023

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

Generative transformer models have become increasingly complex, with lar...
research
06/04/2019

KERMIT: Generative Insertion-Based Modeling for Sequences

We present KERMIT, a simple insertion-based approach to generative model...
research
11/22/2021

L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transfo...

Please sign up or login with your details

Forgot password? Click here to reset