FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

by   Tim Lou, et al.

In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.


page 1

page 2

page 3

page 4


New Approaches to Long Document Summarization: Fourier Transform Based Attention in a Transformer Model

In this work, we extensively redesign the newly introduced method of tok...

FNet: Mixing Tokens with Fourier Transforms

We show that Transformer encoder architectures can be massively sped up,...

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in...

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

The transformer model is known to be computationally demanding, and proh...

FIT: Far-reaching Interleaved Transformers

We present FIT: a transformer-based architecture with efficient self-att...

Axial Attention in Multidimensional Transformers

We propose Axial Transformers, a self-attention-based autoregressive mod...

Autoregressive Modeling with Lookahead Attention

To predict the next token, autoregressive models ordinarily examine the ...

Please sign up or login with your details

Forgot password? Click here to reset