Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

by   Sehoon Kim, et al.

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After reexamining the design choices for both the macro and micro-architecture of Conformer, we propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure, which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of feed-forward module, followed up by multi-head attention or convolution modules, instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depth-wise downsampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5 Librispeech test-other without external language models. This is 3.1 and 0.6 open-sourced and available online.


page 1

page 2


FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

The recently proposed Conformer architecture which combines convolution ...

Attention Enhanced Citrinet for Speech Recognition

Citrinet is an end-to-end convolutional Connectionist Temporal Classific...

Deep Sparse Conformer for Speech Recognition

Conformer has achieved impressive results in Automatic Speech Recognitio...

Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition

The recently proposed Conformer architecture has shown state-of-the-art ...

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

State-of-the-art ASR systems have achieved promising results by modeling...

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in stat...

Romanian Speech Recognition Experiments from the ROBIN Project

One of the fundamental functionalities for accepting a socially assistiv...

Code Repositories

Please sign up or login with your details

Forgot password? Click here to reset