Masked Autoencoders Are Scalable Vision Learners

11/11/2021
by   Kaiming He, et al.
45

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75 and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8 data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

READ FULL TEXT

page 2

page 3

page 6

page 13

page 14

research
03/30/2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...
research
03/29/2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model tha...
research
05/21/2022

Improvements to Self-Supervised Representation Learning for Masked Image Modeling

This paper explores improvements to the masked image modeling (MIM) para...
research
07/31/2022

SdAE: Self-distillated Masked Autoencoder

With the development of generative-based self-supervised learning (SSL) ...
research
01/14/2022

Time Series Generation with Masked Autoencoder

This paper shows that masked autoencoders with interpolators (InterpoMAE...
research
12/03/2022

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Autoregressive language modeling (ALM) have been successfully used in se...
research
05/28/2022

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Self-supervised Masked Autoencoders (MAE) are emerging as a new pre-trai...

Please sign up or login with your details

Forgot password? Click here to reset