Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

08/24/2022
by   Puneet Kumar, et al.
0

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTA Net fuses the information from image, speech text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average without human intervention. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of real-life images, corresponding speech text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad.'). The VISTA Net has resulted in 95.99 accuracy on considering image, speech, and text modalities, which is better than the performance on considering the inputs of any one or two modalities.

READ FULL TEXT

page 1

page 6

page 18

research
08/25/2022

Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data

This paper proposes a multimodal emotion recognition system based on hyb...
research
06/05/2023

Interpretable Multimodal Emotion Recognition using Facial Features and Physiological Signals

This paper aims to demonstrate the importance and feasibility of fusing ...
research
02/27/2023

Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

The lack of data and the difficulty of multimodal fusion have always bee...
research
08/15/2020

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affec...
research
09/15/2021

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Automatic emotion recognition (AER) based on enriched multimodal inputs,...
research
11/09/2019

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

We present M3ER, a learning-based method for emotion recognition from mu...
research
02/22/2018

Deep Multimodal Learning for Emotion Recognition in Spoken Language

In this paper, we present a novel deep multimodal framework to predict h...

Please sign up or login with your details

Forgot password? Click here to reset