An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

by   Xuankai Chang, et al.

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.


page 1

page 2

page 3

page 4


Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

We study speech intent classification and slot filling (SICSF) by propos...

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

The SOTA in transcription of disfluent and conversational speech has in ...

DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children's ASR

Self-supervised learning (SSL) in the pretraining stage using un-annotat...

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...

Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

Czech is a very specific language due to its large differences between t...

Blockwise Self-Supervised Learning at Scale

Current state-of-the-art deep networks are all powered by backpropagatio...

Please sign up or login with your details

Forgot password? Click here to reset