Multilingual Adaptation of RNN Based ASR Systems
A large amount of data is required for automatic speech recognition (ASR) systems achieving good performance. While such data is readily available for languages like English, there exists a long tail of languages with only limited language resources. By using data from additional source languages, this problem can be mitigated. In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units to train systems jointly on multiple languages poses difficulties: While the same phones share the same symbols across languages, they are pronounced slightly different because of, e.g., small shifts in tongue positions. To address this issue, we proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems. In this work, we extended this approach by introducing a novel technique which we call "modulation" to add LFVs . We evaluated our approach in multiple conditions, showing improvements in both full and low resource conditions as well as for grapheme and phone based systems.
READ FULL TEXT