Bio-Inspired Modality Fusion for Active Speaker Detection

02/28/2020
by   Gustavo Assunção, et al.
0

Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known "cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

READ FULL TEXT

page 1

page 2

page 4

page 6

research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
02/23/2021

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
06/07/2021

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion

It is now well established from a variety of studies that there is a sig...
research
09/01/2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

The strong relation between face and voice can aid active speaker detect...
research
09/28/2021

Neural Dependency Coding inspired Multimodal Fusion

Information integration from different modalities is an active area of r...
research
05/10/2022

Spike-based computational models of bio-inspired memories in the hippocampal CA3 region on SpiNNaker

The human brain is the most powerful and efficient machine in existence ...

Please sign up or login with your details

Forgot password? Click here to reset