KDEformer: Accelerating Transformers via Kernel Density Estimation

02/05/2023
by   Amir Zandieh, et al.
0

Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over 4× speedup. For ImageNet classification with T2T-ViT, KDEformer shows over 18× speedup while the accuracy drop is less than 0.5%.

READ FULL TEXT

page 4

page 20

page 26

research
03/03/2021

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence model...
research
09/30/2020

Rethinking Attention with Performers

We introduce Performers, Transformer architectures which can estimate re...
research
05/27/2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the tim...
research
06/23/2021

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

The attention module, which is a crucial component in Transformer, canno...
research
07/12/2021

Combiner: Full Attention Transformer with Sparse Computation Cost

Transformers provide a class of expressive architectures that are extrem...
research
12/01/2022

Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation

Kernel matrices, as well as weighted graphs represented by them, are ubi...
research
07/17/2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Scaling Transformers to longer sequence lengths has been a major problem...

Please sign up or login with your details

Forgot password? Click here to reset