State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions

by   Kyu J. Han, et al.

Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model architecture, namely multi-stream self-attention, to address the issue thus make the self-attention mechanism more effective for speech recognition. The proposed model architecture consists of parallel streams of self-attention encoders, and each stream has layers of 1D convolutions with dilated kernels whose dilation rates are unique given stream, followed by a self-attention layer. The self-attention mechanism in each stream pays attention to only one resolution of input speech frames and the attentive computation can be more efficient. In a later stage, outputs from all the streams are concatenated then linearly projected to the final embedding. By stacking the proposed multi-stream self-attention encoder blocks and rescoring the resultant lattices with neural network language models, we achieve the word error rate of 2.2 the test-clean dataset of the LibriSpeech corpus, the best number reported thus far on the dataset.


page 1

page 2

page 3

page 4


Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-EndSpeech Recognition

End-to-end models are favored in automatic speech recognition (ASR) beca...

On the Locality of Attention in Direct Speech Translation

Transformers have achieved state-of-the-art results across multiple NLP ...

Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention

Speech recognition is a well developed research field so that the curren...

Parallel Scheduling Self-attention Mechanism: Generalization and Optimization

Over the past few years, self-attention is shining in the field of deep ...

Speech Denoising in the Waveform Domain with Self-Attention

In this work, we present CleanUNet, a causal speech denoising model on t...

AttViz: Online exploration of self-attention for transparent neural language modeling

Neural language models are becoming the prevailing methodology for the t...

FsaNet: Frequency Self-attention for Semantic Segmentation

Considering the spectral properties of images, we propose a new self-att...

Please sign up or login with your details

Forgot password? Click here to reset