SepViT: Separable Vision Transformer

03/29/2022
by   Wei Li, et al.
0

Vision Transformers have witnessed prevailing success in a series of vision tasks. However, they often require enormous amount of computations to achieve high performance, which is burdensome to deploy on resource-constrained devices. To address these issues, we draw lessons from depthwise separable convolution and imitate its ideology to design the Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the information interaction within and among the windows via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to model the attention relationship among windows with negligible computational cost and capture a long-range visual dependencies of multiple windows, respectively. Extensive experiments on various benchmark tasks demonstrate SepViT can achieve state-of-the-art results in terms of trade-off between accuracy and latency. Among them, SepViT achieves 84.0 on ImageNet-1K classification while decreasing the latency by 40 the ones with similar accuracy (e.g., CSWin, PVTV2). As for the downstream vision tasks, SepViT with fewer FLOPs can achieve 50.4 segmentation task, 47.5 AP on the RetinaNet-based COCO detection task, 48.7 box AP and 43.9 mask AP on Mask R-CNN-based COCO detection and segmentation tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/25/2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This paper presents a new vision Transformer, called Swin Transformer, t...
research
06/06/2022

Separable Self-attention for Mobile Vision Transformers

Mobile vision transformers (MobileViT) can achieve state-of-the-art perf...
research
07/01/2021

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-bas...
research
11/21/2022

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision t...
research
05/29/2022

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Vision Transformer (ViT) has achieved remarkable performance in many vis...
research
11/18/2021

Swin Transformer V2: Scaling Up Capacity and Resolution

We present techniques for scaling Swin Transformer up to 3 billion param...
research
08/17/2022

Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems

Vision-Transformers are widely used in various vision tasks. Meanwhile, ...

Please sign up or login with your details

Forgot password? Click here to reset