Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech

06/15/2022
by   Jan Lehečka, et al.
0

In this paper, we present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech, and subsequently fine-tuning the model on automatic speech recognition tasks using a combination of in-domain data and almost 6 thousand hours of out-of-domain transcribed speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets (CommonVoice and VoxPopuli) and one extremely challenging dataset from the MALACH project. Our results show that monolingual Wav2Vec 2.0 models are robust ASR systems, which can take advantage of large labeled and unlabeled datasets and successfully compete with state-of-the-art LVCSR systems. Moreover, Wav2Vec models proved to be good zero-shot learners when no training data are available for the target ASR task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2020

Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition

Recently, there has been significant progress made in Automatic Speech R...
research
11/02/2022

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

Research on automatic speech recognition (ASR) systems for electrolaryng...
research
12/06/2022

Robust Speech Recognition via Large-Scale Weak Supervision

We study the capabilities of speech processing systems trained simply to...
research
06/15/2022

Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

Czech is a very specific language due to its large differences between t...
research
09/18/2023

Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

Large self-supervised pre-trained speech models require computationally ...
research
11/13/2021

Prediction of Listener Perception of Argumentative Speech in a Crowdsourced Dataset Using (Psycho-)Linguistic and Fluency Features

One of the key communicative competencies is the ability to maintain flu...
research
01/06/2023

Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition

Despite improvements to the generalization performance of automated spee...

Please sign up or login with your details

Forgot password? Click here to reset