On the Importance of Local Information in Transformer Based Models

08/13/2020
by   Madhura Pande, et al.
0

The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in Transformer-based models can be constrained to be local without affecting performance.

READ FULL TEXT
research
05/25/2023

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Transformer architecture has shown impressive performance in multiple re...
research
01/22/2021

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

Multi-headed attention heads are a mainstay in transformer-based models....
research
03/20/2023

Robustifying Token Attention for Vision Transformers

Despite the success of vision transformers (ViTs), they still suffer fro...
research
11/10/2019

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

Attention-based models have shown significant improvement over tradition...
research
08/24/2022

Addressing Token Uniformity in Transformers via Singular Value Transformation

Token uniformity is commonly observed in transformer-based models, in wh...
research
07/12/2021

The Brownian motion in the transformer model

Transformer is the state of the art model for many language and visual t...
research
04/22/2021

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...

Please sign up or login with your details

Forgot password? Click here to reset