MaxViT: Multi-Axis Vision Transformer

04/04/2022
by   Zhengzhong Tu, et al.
0

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to "see" globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. We will make the code and models publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/09/2022

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models ...
research
07/01/2022

Rethinking Query-Key Pairwise Interactions in Vision Transformers

Vision Transformers have achieved state-of-the-art performance in many v...
research
06/14/2021

Improved Transformer for High-Resolution GANs

Attention-based models, exemplified by the Transformer, can effectively ...
research
04/06/2023

MULLER: Multilayer Laplacian Resizer for Vision

Image resizing operation is a fundamental preprocessing module in modern...
research
10/12/2022

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

Visual data such as images and videos are typically modeled as discretiz...
research
08/10/2023

Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention

Convolutional neural networks (CNNs) and vision transformers (ViTs) have...
research
10/06/2020

Rotate to Attend: Convolutional Triplet Attention Module

Benefiting from the capability of building inter-dependencies among chan...

Please sign up or login with your details

Forgot password? Click here to reset