The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

09/15/2023
by   Shilong Wu, et al.
0

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

READ FULL TEXT

page 2

page 4

research
03/11/2023

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

The Multi-modal Information based Speech Processing (MISP) challenge aim...
research
06/13/2019

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Speech recognition in cocktail-party environments remains a significant ...
research
06/14/2021

Learning Audio-Visual Dereverberation

Reverberation from audio reflecting off surfaces and objects in the envi...
research
06/05/2019

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Several audio-visual speech recognition models have been recently propos...
research
06/16/2022

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based a...
research
06/18/2023

STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced Audio-Visual Diarization

This report introduces our novel method named STHG for the Audio-Visual ...
research
08/20/2020

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

Automatic speech-based affect recognition of individuals in dyadic conve...

Please sign up or login with your details

Forgot password? Click here to reset