Efficient Masked Autoencoders with Self-Consistency

by   Zhaowen Li, et al.

Inspired by masked language modeling (MLM) in natural language processing, masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training method in computer vision. However, its high random mask ratio would result in two serious problems: 1) the data are not efficiently exploited, which brings inefficient pre-training (, 1600 epochs for MAE vs. 300 epochs for the supervised), and 2) the high uncertainty and inconsistency of the pre-trained model, , the prediction of the same patch may be inconsistent under different mask rounds. To tackle these problems, we propose efficient masked autoencoders with self-consistency (EMAE), to improve the pre-training efficiency and increase the consistency of MIM. In particular, we progressively divide the image into K non-overlapping parts, each of which is generated by a random mask and has the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and generates predictions. Besides, we design a self-consistency module to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, the proposed method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves even higher results with only 300 pre-training epochs under ViT-Base than MAE (1600 epochs). EMAE also consistently obtains state-of-the-art transfer performance on various downstream tasks, like object detection, and semantic segmentation.


page 1

page 3

page 8


The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

The computer vision world has been re-gaining enthusiasm in various pre-...

Multi-Level Contrastive Learning for Dense Prediction Task

In this work, we present Multi-Level Contrastive Learning for Dense Pred...

MTSMAE: Masked Autoencoders for Multivariate Time-Series Forecasting

Large-scale self-supervised pre-training Transformer architecture have s...

Exploring Long-Sequence Masked Autoencoders

Masked Autoencoding (MAE) has emerged as an effective approach for pre-t...

RARE: Robust Masked Graph Autoencoder

Masked graph autoencoder (MGAE) has emerged as a promising self-supervis...

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Recently, significant progress has been made in masked image modeling to...

CAE v2: Context Autoencoder with CLIP Target

Masked image modeling (MIM) learns visual representation by masking and ...

Please sign up or login with your details

Forgot password? Click here to reset