Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

07/06/2023
by   Guinan Li, et al.
0

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1 (41.7 enhancement improvements were also obtained on PESQ, STOI and SRMR scores.

READ FULL TEXT

page 1

page 3

page 12

research
04/05/2022

Audio-visual multi-channel speech separation, dereverberation and recognition

Despite the rapid advance of automatic speech recognition (ASR) technolo...
research
04/16/2019

Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition Losses

In this paper we propose an end-to-end LSTM-based model that performs si...
research
11/04/2020

DESNet: A Multi-channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation

In this paper, we propose a multi-channel network for simultaneous speec...
research
01/06/2020

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Automatic recognition of overlapped speech remains a highly challenging ...
research
04/23/2019

Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge

In this study, we concentrate on replacing the process of extracting han...
research
12/11/2021

Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR

Single-channel speech enhancement approaches do not always improve autom...
research
05/08/2020

Neural Spatio-Temporal Beamformer for Target Speech Separation

Purely neural network (NN) based speech separation and enhancement metho...

Please sign up or login with your details

Forgot password? Click here to reset