Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

10/27/2022
by   Fernando Lopez, et al.
0

High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2020

Semi-supervised ASR by End-to-end Self-training

While deep learning based end-to-end automatic speech recognition (ASR) ...
research
06/22/2022

A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data

Automatic Speech Recognition(ASR) has been dominated by deep learning-ba...
research
02/27/2023

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

We propose an end-to-end ASR system that can be trained on transcribed s...
research
10/30/2020

Joint Masked CPC and CTC Training for ASR

Self-supervised learning (SSL) has shown promise in learning representat...
research
04/21/2021

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

This paper proposes a novel label-synchronous speech-to-text alignment t...
research
04/30/2019

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...
research
11/28/2016

Who's that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images

In this work, we aim at automatically labeling actors in a TV series. Ra...

Please sign up or login with your details

Forgot password? Click here to reset