Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

by   Menglong Xu, et al.

Recently, several studies reported that dot-product selfattention (SA) may not be indispensable to the state-of-theart Transformer models. Motivated by the fact that dense synthesizer attention (DSA), which dispenses with dot products and pairwise interactions, achieved competitive results in many language processing tasks, in this paper, we first propose a DSA-based speech recognition, as an alternative to SA. To reduce the computational complexity and improve the performance, we further propose local DSA (LDSA) to restrict the attention scope of DSA to a local range around the current central frame for speech recognition. Finally, we combine LDSA with SA to extract the local and global information simultaneously. Experimental results on the Ai-shell1 Mandarine speech recognition corpus show that the proposed LDSA-Transformer achieves a character error rate (CER) of 6.49 that of the SA-Transformer. Meanwhile, the LDSA-Transformer requires less computation than the SATransformer. The proposed combination method not only achieves a CER of 6.18 but also has roughly the same number of parameters and computational complexity as the latter. The implementation of the multi-head LDSA is available at


page 1

page 2

page 3

page 4


HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

State-of-the-art ASR systems have achieved promising results by modeling...

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

We explore options to use Transformer networks in neural transducer for ...

Sim-T: Simplify the Transformer Network by Multiplexing Technique for Speech Recognition

In recent years, a great deal of attention has been paid to the Transfor...

Efficient End-to-End Speech Recognition Using Performers in Conformers

On-device end-to-end speech recognition poses a high requirement on mode...

Multi-Head State Space Model for Speech Recognition

State space models (SSMs) have recently shown promising results on small...

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Wav2vec2 has achieved success in applying Transformer architecture and s...

Please sign up or login with your details

Forgot password? Click here to reset