OmniMAE: Single Model Masked Pretraining on Images and Videos

06/16/2022
by   Rohit Girdhar, et al.
11

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5 Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90 training.

READ FULL TEXT

page 2

page 6

page 7

page 18

research
01/20/2022

Omnivore: A Single Model for Many Visual Modalities

Prior work has studied different visual modalities in isolation and deve...
research
11/25/2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Can we train a single transformer model capable of processing multiple m...
research
06/10/2019

UniDual: A Unified Model for Image and Video Understanding

Although a video is effectively a sequence of images, visual perception ...
research
08/21/2023

Joint learning of images and videos with a single Vision Transformer

In this study, we propose a method for jointly learning of images and vi...
research
10/12/2022

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

Visual data such as images and videos are typically modeled as discretiz...
research
03/01/2021

M6: A Chinese Multimodal Pretrainer

In this work, we construct the largest dataset for multimodal pretrainin...
research
06/05/2023

Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance

We present a novel framework to overcome the limitations of equivariant ...

Please sign up or login with your details

Forgot password? Click here to reset