An Attention Free Transformer

05/28/2021
by   Shuangfei Zhai, et al.
37

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

READ FULL TEXT

page 13

page 14

page 15

page 16

page 17

page 18

page 19

page 20

research
07/22/2023

Simple parameter-free self-attention approximation

The hybrid model of self-attention and convolution is one of the methods...
research
11/30/2021

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Built on top of self-attention mechanisms, vision transformers have demo...
research
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...
research
07/06/2022

MaiT: Leverage Attention Masks for More Efficient Image Transformers

Though image transformers have shown competitive results with convolutio...
research
04/12/2021

GAttANet: Global attention agreement for convolutional neural networks

Transformer attention architectures, similar to those developed for natu...
research
06/14/2023

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

We present the first unified study of the efficiency of self-attention-b...
research
01/06/2022

TransVPR: Transformer-based place recognition with multi-level attention aggregation

Visual place recognition is a challenging task for applications such as ...

Please sign up or login with your details

Forgot password? Click here to reset