Approximating How Single Head Attention Learns

by   Charlie Snell, et al.

Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is o because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness.


page 1

page 2

page 3

page 4


Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

In this paper, we present a character-based BiLSTM model for splitting I...

Seeing Both the Forest and the Trees: Multi-head Attention for Joint Classification on Different Compositional Levels

In natural languages, words are used in association to construct sentenc...

Generating Fact Checking Summaries for Web Claims

We present SUMO, a neural attention-based approach that learns to establ...

Orthogonality Constrained Multi-Head Attention For Keyword Spotting

Multi-head attention mechanism is capable of learning various representa...

Teaching Machines to Code: Neural Markup Generation with Visual Attention

We present a deep recurrent neural network model with soft visual attent...

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

The acoustic-to-word model based on the Connectionist Temporal Classific...

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...

Please sign up or login with your details

Forgot password? Click here to reset