Multiscale Vision Transformers

04/22/2021
by   Haoqi Fan, et al.
9

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...
research
12/02/2021

Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...
research
09/16/2018

Bayesian Modular and Multiscale Regression

We tackle the problem of multiscale regression for predictors that are s...
research
09/15/2021

PnP-DETR: Towards Efficient Visual Analysis with Transformers

Recently, DETR pioneered the solution of vision tasks with transformers,...
research
04/01/2022

Transformers for 1D Signals in Parkinson's Disease Detection from Gait

This paper focuses on the detection of Parkinson's disease based on the ...
research
11/01/2021

HRViT: Multi-Scale High-Resolution Vision Transformer

Vision transformers (ViTs) have attracted much attention for their super...
research
02/14/2022

How Do Vision Transformers Work?

The success of multi-head self-attentions (MSAs) for computer vision is ...

Please sign up or login with your details

Forgot password? Click here to reset