Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation

by   Nilaksh Das, et al.

End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8 by up to 24.6 domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3 relative.


page 1

page 2

page 3

page 4


Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

The external language models (LM) integration remains a challenging task...

Decoupled Structure for Improved Adaptability of End-to-End Models

Although end-to-end (E2E) trainable automatic speech recognition (ASR) h...

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

ASR model deployment environment is ever-changing, and the incoming spee...

Modular Hybrid Autoregressive Transducer

Text-only adaptation of a transducer model remains challenging for end-t...

Internal language model estimation through explicit context vector learning for attention-based encoder-decoder ASR

An end-to-end (E2E) speech recognition model implicitly learns a biased ...

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Attention-based encoder-decoder (AED) models learn an implicit internal ...

Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems

End-2-end (E2E) models have become increasingly popular in some ASR task...

Please sign up or login with your details

Forgot password? Click here to reset