Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition

by   Leyuan Qu, et al.

Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.


page 1

page 2

page 3

page 4


Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary Words in End-To-End ASR Systems

Today, many state-of-the-art automatic speech recognition (ASR) systems ...

Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition

Neural sequence-to-sequence systems deliver state-of-the-art performance...

Short-Term Word-Learning in a Dynamically Changing Environment

Neural sequence-to-sequence automatic speech recognition (ASR) systems a...

Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) sys...

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

In this paper, we present a series of complementary approaches to improv...

Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings

In embedding-matching acoustic-to-word (A2W) ASR, every word in the voca...

The Effectiveness of Applying Different Strategies on Recognition and Recall Textual Password

Using English words as passwords have been a popular topic in the last f...

Please sign up or login with your details

Forgot password? Click here to reset