MGMAE: Motion Guided Masking for Video Masked Autoencoding

08/21/2023
by   Bingkun Huang, et al.
0

Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.

READ FULL TEXT

page 4

page 8

research
10/12/2022

M^3Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

We study self-supervised video representation learning that seeks to lea...
research
03/23/2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pre-training video transformers on extra large-scale datasets is general...
research
03/29/2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model tha...
research
08/12/2021

Deep Motion Prior for Weakly-Supervised Temporal Action Localization

Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize ...
research
11/19/2022

Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection

Self-supervised Video Representation Learning (VRL) aims to learn transf...
research
12/21/2022

MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning

Learning effective motion features is an essential pursuit of video repr...
research
08/24/2023

Motion-Guided Masking for Spatiotemporal Representation Learning

Several recent works have directly extended the image masked autoencoder...

Please sign up or login with your details

Forgot password? Click here to reset