Global Interaction Modelling in Vision Transformer via Super Tokens

11/25/2021
by   ammarah-farooq, et al.
0

With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for cross-window communication and global representation learning. Hence, most of the learning is independent of the image patches (N) in the higher layers, and the class embedding is learned solely based on the Super tokens (N/M^2) where M^2 is the window size. In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy which is equivalent to Swin transformer (Swin-B) with circa half the number of parameters (49M) and double the inference time throughput. The proposed Super token transformer offers a lightweight and promising backbone for visual recognition tasks.

READ FULL TEXT
research
11/11/2022

Token Transformer: Can class token help window-based transformer build better long-range interactions?

Compared with the vanilla transformer, the window-based transformer offe...
research
11/21/2022

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision t...
research
09/11/2023

SparseSwin: Swin Transformer with Sparse Transformer Block

Advancements in computer vision research have put transformer architectu...
research
08/19/2022

Improved Image Classification with Token Fusion

In this paper, we propose a method using the fusion of CNN and transform...
research
06/23/2022

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to...
research
01/30/2022

Aggregating Global Features into Local Vision Transformer

Local Transformer-based classification models have recently achieved pro...
research
08/12/2021

Mobile-Former: Bridging MobileNet and Transformer

We present Mobile-Former, a parallel design of MobileNet and Transformer...

Please sign up or login with your details

Forgot password? Click here to reset