GMAT: Global Memory Augmentation for Transformers

06/05/2020
by   Jonathan Berant, et al.
8

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the pairwise dot-product attention that has a large Ω(L^2) memory requirement for length L sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were proposed to reduce the quadratic memory requirement using sparse attention matrices. In this work, we propose to augment sparse Transformer blocks with a dense attention-based global memory of length M (≪ L) which provides an aggregate global view of the entire input sequence to each position. Our augmentation has a manageable O(M·(L+M)) memory overhead, and can be seamlessly integrated with prior sparse solutions. Moreover, global memory can also be used for sequence compression, by representing a long input sequence with the memory representations only. We empirically show that our method leads to substantial improvement on a range of tasks, including (a) synthetic tasks that require global reasoning, (b) masked language modeling, and (c) reading comprehension.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2020

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many ...
research
04/17/2020

ETC: Encoding Long and Structured Data in Transformers

Transformer-based models have pushed the state of the art in many natura...
research
12/03/2022

Global memory transformer for processing long documents

Transformer variants dominate the state-of-the-art in different natural ...
research
08/25/2023

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Although dominant in natural language processing, transformer-based mode...
research
02/26/2020

Sparse Sinkhorn Attention

We propose Sparse Sinkhorn Attention, a new efficient and sparse method ...
research
05/05/2023

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Unlike recurrent models, conventional wisdom has it that Transformers ca...
research
02/13/2023

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

State space models (SSMs) have high performance on long sequence modelin...

Please sign up or login with your details

Forgot password? Click here to reset