Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

11/08/2022
by   Shucong Zhang, et al.
0

Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

We present a frontend for improving robustness of automatic speech recog...
research
11/23/2017

Visual Speech Enhancement

When video is shot in noisy environment, the voice of a speaker seen in ...
research
11/03/2022

Iterative autoregression: a novel trick to improve your low-latency speech enhancement model

Streaming models are an essential component of real-time speech enhancem...
research
09/25/2019

MPEG-H Audio for Improving Accessibility in Broadcasting and Streaming

Broadcasting and streaming services still suffer from various levels of ...
research
04/05/2021

Real-time Streaming Wave-U-Net with Temporal Convolutions for Multichannel Speech Enhancement

In this paper, we describe the work that we have done to participate in ...
research
05/17/2023

BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions

Time-domain single-channel speech enhancement (SE) still remains challen...
research
08/17/2020

Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks

We propose Mobile Audio Streaming Networks (MASnet) for efficient low-la...

Please sign up or login with your details

Forgot password? Click here to reset