On the Locality of Attention in Direct Speech Translation

04/19/2022
by   Belen Alastruey, et al.
4

Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.

READ FULL TEXT

page 1

page 9

page 10

page 11

research
04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
05/14/2022

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in...
research
10/01/2019

State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions

Self-attention has been a huge success for many downstream tasks in NLP,...
research
01/20/2021

Classifying Scientific Publications with BERT – Is Self-Attention a Feature Selection Method?

We investigate the self-attention mechanism of BERT in a fine-tuning sce...
research
06/16/2019

Theoretical Limitations of Self-Attention in Neural Sequence Models

Transformers are emerging as the new workhorse of NLP, showing great suc...
research
10/09/2022

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Autoregressive Transformers are strong language models but incur O(T) co...
research
10/07/2019

Why Attention? Analyzing and Remedying BiLSTM Deficiency in Modeling Cross-Context for NER

State-of-the-art approaches of NER have used sequence-labeling BiLSTM as...

Please sign up or login with your details

Forgot password? Click here to reset