L-Verse: Bidirectional Generation Between Image and Text

by   Taehoon Kim, et al.

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for text-to-image and image-to-text generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial results of bidirectional vision-language representation learning on general domain.


page 1

page 4

page 6

page 8

page 13

page 14

page 15

page 16


ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Conventional methods for the image-text generation tasks mainly tackle t...

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

The development of the transformer-based text-to-image models are impede...

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Recently, vector-quantized image modeling has demonstrated impressive pe...

CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open pro...

RoCOCO: Robust Benchmark MS-COCO to Stress-test Robustness of Image-Text Matching Models

Recently, large-scale vision-language pre-training models and visual sem...

MAGVLT: Masked Generative Vision-and-Language Transformer

While generative modeling on multimodal image-text data has been activel...

Please sign up or login with your details

Forgot password? Click here to reset