SMYRF: Efficient Attention using Asymmetric Clustering

by   Giannis Daras, et al.

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N^2) to O(N log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.


page 6

page 18

page 25


What Does BERT Look At? An Analysis of BERT's Attention

Large pre-trained neural networks such as BERT have had great recent suc...

Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...

Input-length-shortening and text generation via attention values

Identifying words that impact a task's performance more than others is a...

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly ...

Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation with BERT

Single channel speech dereverberation is considered in this work. Inspir...

Exploring the Space of Key-Value-Query Models with Intention

Attention-based models have been a key element of many recent breakthrou...

Please sign up or login with your details

Forgot password? Click here to reset