LongNet: Scaling Transformers to 1,000,000,000 Tokens

07/05/2023
by   Jiayu Ding, et al.
0

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Staircase Attention for Recurrent Processing of Sequences

Attention mechanisms have become a standard tool for sequence modeling t...
research
04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
02/21/2023

Hyena Hierarchy: Towards Larger Convolutional Language Models

Recent advances in deep learning have relied heavily on the use of large...
research
05/07/2023

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Transformer models are foundational to natural language processing (NLP)...
research
09/15/2023

Cure the headache of Transformers via Collinear Constrained Attention

As the rapid progression of practical applications based on Large Langua...
research
10/06/2022

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Transformer is the cornerstone model of Natural Language Processing (NLP...
research
03/11/2022

Block-Recurrent Transformers

We introduce the Block-Recurrent Transformer, which applies a transforme...

Please sign up or login with your details

Forgot password? Click here to reset