FIT: Far-reaching Interleaved Transformers

05/22/2023
by   Ting Chen, et al.
0

We present FIT: a transformer-based architecture with efficient self-attention and adaptive computation. Unlike original transformers, which operate on a single sequence of data tokens, we divide the data tokens into groups, with each group being a shorter sequence of tokens. We employ two types of transformer layers: local layers operate on data tokens within each group, while global layers operate on a smaller set of introduced latent tokens. These layers, comprising the same set of self-attention and feed-forward layers as standard transformers, are interleaved, and cross-attention is used to facilitate information exchange between data and latent tokens within the same group. The attention complexity is O(n^2) locally within each group of size n, but can reach O(L^4/3) globally for sequence length of L. The efficiency can be further enhanced by relying more on global layers that perform adaptive computation using a smaller set of latent tokens. FIT is a versatile architecture and can function as an encoder, diffusion decoder, or autoregressive decoder. We provide initial evidence demonstrating its effectiveness in high-resolution image understanding and generation tasks. Notably, FIT exhibits potential in performing end-to-end training on gigabit-scale data, such as 6400×6400 images, even without specific optimizations or model parallelism.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Staircase Attention for Recurrent Processing of Sequences

Attention mechanisms have become a standard tool for sequence modeling t...
research
02/07/2021

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Transformers have emerged as a powerful tool for a broad range of natura...
research
05/09/2021

FNet: Mixing Tokens with Fourier Transforms

We show that Transformer encoder architectures can be massively sped up,...
research
05/02/2020

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from att...
research
05/12/2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Autoregressive transformers are spectacular models for short sequences b...
research
01/07/2021

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

To apply neural sequence models such as the Transformers to music genera...
research
07/22/2021

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

In this note we examine the autoregressive generalization of the FNet al...

Please sign up or login with your details

Forgot password? Click here to reset