Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

08/20/2021
by   Chuhan Wu, et al.
0

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing sparse self-attention instead of a dense one, which usually attends to tokens at certain positions or randomly selected tokens. However, manually selected or random tokens may be uninformative for context modeling. In this paper, we propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention. In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer, which aims to find potential important interactions between tokens. We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads. Finally, we select token embeddings according to the index matrices to form the input of sparse attention networks. Extensive experiments on six benchmark datasets for different tasks validate the efficiency and effectiveness of Smart Bird in text modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2021

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...
research
09/28/2022

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision transformer has emerged as a new paradigm in computer vision, sho...
research
10/14/2020

DA-Transformer: Distance-aware Transformer

Transformer has achieved great success in the NLP field by composing var...
research
06/07/2021

On the Expressive Power of Self-Attention Matrices

Transformer networks are able to capture patterns in data coming from ma...
research
11/21/2022

PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism

Existing deep calibrated photometric stereo networks basically aggregate...
research
05/02/2020

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from att...
research
01/30/2022

Fast Monte-Carlo Approximation of the Attention Mechanism

We introduce Monte-Carlo Attention (MCA), a randomized approximation met...

Please sign up or login with your details

Forgot password? Click here to reset