AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95 faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0 SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.


page 1

page 4

page 6

page 11

page 12

page 14

page 15


Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Pre-training by numerous image data has become de-facto for robust 2D re...

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Video understanding relies on perceiving the global content and modeling...

Masked Autoencoders As Spatiotemporal Learners

This paper studies a conceptually simple extension of Masked Autoencoder...

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

This paper presents SimVTP: a Simple Video-Text Pretraining framework vi...

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a si...

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Video Foundation Models (VFMs) have received limited exploration due to ...

Generating Individual Trajectories Using GPT-2 Trained from Scratch on Encoded Spatiotemporal Data

Following Mizuno, Fujimoto, and Ishikawa's research (Front. Phys. 2022),...

Please sign up or login with your details

Forgot password? Click here to reset