Multistage linguistic conditioning of convolutional layers for speech emotion recognition

In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Our accompanying analysis further reveals a hitherto unexplored role of the underlying dialogue acts on unimodal and bimodal SER, with different models showing a biased behaviour across different acts. Overall, our multistage fusion shows better quantitative performance, surpassing all alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.

READ FULL TEXT

page 5

page 7

page 17

research
07/06/2019

Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

This paper presents a novel deep neural network (DNN) for multimodal fus...
research
04/20/2021

On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion Recognition: An Update for the Deep Learning Era

Text encodings from automatic speech recognition (ASR) transcripts and a...
research
07/11/2022

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

The research and applications of multimodal emotion recognition have bec...
research
03/06/2020

Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals

Robustness against temporal variations is important for emotion recognit...
research
06/12/2023

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

The emotion detection technology to enhance human decision-making is an ...
research
06/06/2019

Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild

Emotion recognition plays an important role in human-computer interactio...
research
10/08/2021

Affective Burst Detection from Speech using Kernel-fusion Dilated Convolutional Neural Networks

As speech-interfaces are getting richer and widespread, speech emotion r...

Please sign up or login with your details

Forgot password? Click here to reset