Which is Making the Contribution: Modulating Unimodal and Cross-modal Dynamics for Multimodal Sentiment Analysis

by   Ying Zeng, et al.

Multimodal sentiment analysis (MSA) draws increasing attention with the availability of multimodal data. The boost in performance of MSA models is mainly hindered by two problems. On the one hand, recent MSA works mostly focus on learning cross-modal dynamics, but neglect to explore an optimal solution for unimodal networks, which determines the lower limit of MSA models. On the other hand, noisy information hidden in each modality interferes the learning of correct cross-modal dynamics. To address the above-mentioned problems, we propose a novel MSA framework Modulation Model for Multimodal Sentiment Analysis (M^3SA) to identify the contribution of modalities and reduce the impact of noisy information, so as to better learn unimodal and cross-modal dynamics. Specifically, modulation loss is designed to modulate the loss contribution based on the confidence of individual modalities in each utterance, so as to explore an optimal update solution for each unimodal network. Besides, contrary to most existing works which fail to explicitly filter out noisy information, we devise a modality filter module to identify and filter out modality noise for the learning of correct cross-modal embedding. Extensive experiments on publicly datasets demonstrate that our approach achieves state-of-the-art performance.


Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multimodal representation learning is a challenging task in which previo...

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

With the proliferation of user-generated online videos, Multimodal Senti...

Object Segmentation by Mining Cross-Modal Semantics

Multi-sensor clues have shown promise for object segmentation, but inher...

Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition

Fusing multiple modalities for affective computing tasks has proven effe...

TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection

Multimodal hate detection, which aims to identify harmful content online...

Cross-Modal Entity Matching for Visually Rich Documents

Visually rich documents (VRD) are physical/digital documents that utiliz...

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

As one of the most fundamental techniques in multimodal learning, cross-...

Please sign up or login with your details

Forgot password? Click here to reset