Vector-quantized Image Modeling with Improved VQGAN

10/09/2021
by   Jiahui Yu, et al.
0

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3 for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

READ FULL TEXT

page 8

page 15

page 16

research
02/02/2023

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Recent progress in scaling up large language models has shown impressive...
research
10/24/2022

FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners

Large language models (LLM) trained using the next-token-prediction obje...
research
04/12/2022

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Large pretrained Transformer language models have been shown to exhibit ...
research
09/06/2022

Semantic Image Synthesis with Semantically Coupled VQ-Model

Semantic image synthesis enables control over unconditional image genera...
research
12/06/2022

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Vector-Quantized (VQ-based) generative models usually consist of two bas...
research
06/22/2022

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

We present the Pathways Autoregressive Text-to-Image (Parti) model, whic...
research
12/13/2021

Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer

Given the fact of a case, Legal Judgment Prediction (LJP) involves a ser...

Please sign up or login with your details

Forgot password? Click here to reset