An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models

by   Rahil Parikh, et al.

End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these errors, allowing for a redesign with expert knowledge or transfer learning.


page 1

page 2

page 3

page 4


Mixture Encoder for Joint Speech Separation and Recognition

Multi-speaker automatic speech recognition (ASR) is crucial for many rea...

Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations

End-to-end models, particularly Tacotron-based ones, are currently a pop...

Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

In this paper, we propose a three-stage training methodology to improve ...

Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

End-to-end speech summarization (E2E SSum) directly summarizes input spe...

A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet

In this work, we investigate if the learned encoder of the end-to-end co...

Low-Level Physiological Implications of End-to-End Learning of Speech Recognition

Current speech recognition architectures perform very well from the poin...

Please sign up or login with your details

Forgot password? Click here to reset