When Can Self-Attention Be Replaced by Feed Forward Layers?

05/28/2020
by   Shucong Zhang, et al.
0

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability to capture temporal relationships without being limited by the distance between two related events. However, we note that the range of the learned context progressively increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence still important for the upper self-attention layers in the encoder of Transformers? To investigate this, we replace these self-attention layers with feed forward layers. In our speech recognition experiments (Wall Street Journal and Switchboard), we indeed observe an interesting result: replacing the upper self-attention layers in the encoder with feed forward layers leads to no performance drop, and even minor gains. Our experiments offer insights to how self-attention layers process the speech signal, leading to the conclusion that the lower self-attention layers of the encoder encode a sufficiently wide range of inputs, hence learning further contextual information in the upper layers is unnecessary.

READ FULL TEXT
research
11/08/2020

On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Self-attention models such as Transformers, which can capture temporal r...
research
10/28/2019

DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition

Self-attention networks (SAN) have been introduced into automatic speech...
research
02/09/2021

Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Although the lower layers of a deep neural network learn features which ...
research
08/05/2020

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

We consider the design of two-pass voice trigger detection systems. We f...
research
02/08/2022

Modeling Structure with Undirected Neural Networks

Neural networks are powerful function estimators, leading to their statu...
research
03/19/2022

Similarity and Content-based Phonetic Self Attention for Speech Recognition

Transformer-based speech recognition models have achieved great success ...
research
05/29/2023

Brainformers: Trading Simplicity for Efficiency

Transformers are central to recent successes in natural language process...

Please sign up or login with your details

Forgot password? Click here to reset