BiFormer: Vision Transformer with Bi-Level Routing Attention

by   Lei Zhu, et al.

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at <>.


page 2

page 8

page 10


QuadTree Attention for Vision Transformers

Transformers have been successful in many vision tasks, thanks to their ...

SG-Former: Self-guided Transformer with Evolving Token Reallocation

Vision Transformer has demonstrated impressive success across various vi...

MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

Vision Transformer and its variants have demonstrated great potential in...

ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Deep learning (DL) has advanced the field of dense prediction, while gra...

AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows

Recently Transformer has shown good performance in several vision tasks ...

CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector

Due to the success of Bidirectional Encoder Representations from Transfo...

ActiveMLP: An MLP-like Architecture with Active Token Mixer

This paper presents ActiveMLP, a general MLP-like backbone for computer ...

Please sign up or login with your details

Forgot password? Click here to reset