Limits to Depth Efficiencies of Self-Attention

06/22/2020
by   Yoav Levine, et al.
0

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: Empirical signals indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). In this paper, we theoretically study the interplay between depth and width in self-attention, and shed light on the root of the above phenomenon. We invalidate the seemingly plausible hypothesis by which widening is as effective as deepening for self-attention, and show that in fact stacking self-attention layers is so effective that it quickly saturates a capacity of the network width. Specifically, we pinpoint a "depth threshold" that is logarithmic in d_x, the network width: L_th=log_3(d_x). For networks of depth that is below the threshold, we establish a double-exponential depth-efficiency of the self-attention operation, while for depths over the threshold we show that depth-inefficiency kicks in. Our predictions strongly accord with extensive empirical ablations in Kaplan et al. (2020), accounting for the different behaviors in the two depth-(in)efficiency regimes. By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/09/2021

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

After their successful debut in natural language processing, Transformer...
research
01/27/2023

On the Connection Between MPNN and Graph Transformer

Graph Transformer (GT) recently has emerged as a new paradigm of graph l...
research
05/24/2021

Self-Attention Networks Can Process Bounded Hierarchical Languages

Despite their impressive performance in NLP, self-attention networks wer...
research
11/29/2020

Deeper or Wider Networks of Point Clouds with Self-attention?

Prevalence of deeper networks driven by self-attention is in stark contr...
research
12/04/2018

Factorized Attention: Self-Attention with Linear Complexities

Recent works have been applying self-attention to various fields in comp...
research
12/10/2021

Self-attention Does Not Need O(n^2) Memory

We present a very simple algorithm for attention that requires O(1) memo...
research
03/05/2021

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning...

Please sign up or login with your details

Forgot password? Click here to reset