Towards Robust Waveform-Based Acoustic Models

by   Dino Oglic, et al.

We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150 out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).


Contaminated speech training methods for robust DNN-HMM distant speech recognition

Despite the significant progress made in the last years, state-of-the-ar...

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Tacotron-based text-to-speech (TTS) systems directly synthesize speech f...

Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts

Robust speech recognition is a key prerequisite for semantic feature ext...

Feature Normalisation for Robust Speech Recognition

Speech recognition system performance degrades in noisy environments. If...

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Recent neural networks such as WaveNet and sampleRNN that learn directly...

Minimizers of the Empirical Risk and Risk Monotonicity

Plotting a learner's average performance against the number of training ...

Exploring Vicinal Risk Minimization for Lightweight Out-of-Distribution Detection

Deep neural networks have found widespread adoption in solving complex t...

Please sign up or login with your details

Forgot password? Click here to reset