Simple Hardware-Efficient Long Convolutions for Sequence Modeling

02/13/2023
by   Daniel Y. Fu, et al.
8

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions–such as squashing the kernel weights–result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2×, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2× faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30

READ FULL TEXT

page 2

page 11

page 26

research
10/31/2021

Efficiently Modeling Long Sequences with Structured State Spaces

A central goal of sequence modeling is designing a single principled mod...
research
06/15/2023

Block-State Transformer

State space models (SSMs) have shown impressive results on tasks that re...
research
10/17/2022

What Makes Convolutional Models Great on Long Sequence Modeling?

Convolutional models have been widely used in multiple domains. However,...
research
06/05/2020

GMAT: Global Memory Augmentation for Transformers

Transformer-based models have become ubiquitous in natural language proc...
research
02/08/2020

Time-aware Large Kernel Convolutions

To date, most state-of-the-art sequence modelling architectures use atte...
research
06/27/2022

Long Range Language Modeling via Gated State Spaces

State space models have shown to be effective at modeling long range dep...
research
12/01/2022

Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Sequence models based on linear state spaces (SSMs) have recently emerge...

Please sign up or login with your details

Forgot password? Click here to reset