Single-Channel Multi-Speaker Separation using Deep Clustering

07/07/2016
by   Yusuf Isik, et al.
0

Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1 solving the cocktail party problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2018

Cracking the cocktail party problem by multi-beam deep attractor network

While recent progresses in neural network approaches to single-channel s...
research
02/23/2021

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Recently, the end-to-end approach has been successfully applied to multi...
research
02/20/2020

Wavesplit: End-to-End Speech Separation by Speaker Clustering

We introduce Wavesplit, an end-to-end speech separation system. From a s...
research
04/26/2018

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

This paper proposes an end-to-end approach for single-channel speaker-in...
research
01/15/2019

Orthonormal Embedding-based Deep Clustering for Single-channel Speech Separation

Deep clustering is a deep neural network-based speech separation algorit...
research
11/29/2019

Improving Voice Separation by Incorporating End-to-end Speech Recognition

Despite recent advances in voice separation methods, many challenges rem...
research
10/31/2022

DiaCorrect: End-to-end error correction for speaker diarization

In recent years, speaker diarization has attracted widespread attention....

Please sign up or login with your details

Forgot password? Click here to reset