Hyena Hierarchy: Towards Larger Convolutional Language Models

02/21/2023
∙
by   Michael Poli, et al.
∙
7
∙

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

READ FULL TEXT

page 27

page 28

page 32

page 33

page 34

page 35

page 36

page 38

research
∙ 04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
∙ 12/28/2022

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

State space models (SSMs) have demonstrated state-of-the-art sequence mo...
research
∙ 06/01/2023

Coneheads: Hierarchy Aware Attention

Attention networks such as transformers have achieved state-of-the-art p...
research
∙ 04/23/2019

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory t...
research
∙ 07/05/2023

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large...
research
∙ 05/13/2021

Not All Memories are Created Equal: Learning to Forget by Expiring

Attention mechanisms have shown promising results in sequence modeling t...
research
∙ 10/28/2021

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Recent advances in efficient Transformers have exploited either the spar...

Please sign up or login with your details

Forgot password? Click here to reset