Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

by   Sijie Mai, et al.

Learning effective joint embedding for cross-modal data has always been a focus in the field of multimodal machine learning. We argue that during multimodal fusion, the generated multimodal embedding may be redundant, and the discriminative unimodal information may be ignored, which often interferes with accurate prediction and leads to a higher risk of overfitting. Moreover, unimodal representations also contain noisy information that negatively influences the learning of cross-modal dynamics. To this end, we introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation that is free of redundancy and to filter out noisy information in unimodal representations. Specifically, inheriting from the general information bottleneck (IB), MIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target and simultaneously constraining the mutual information between the representation and the input data. Different from general IB, our MIB regularizes both the multimodal and unimodal representations, which is a comprehensive and flexible framework that is compatible with any fusion methods. We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints. Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition across three widely used datasets. The codes are available at <>.


Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion

Video multimodal fusion aims to integrate multimodal signals in videos, ...

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

In multimodal sentiment analysis (MSA), the performance of a model highl...

Provable Dynamic Fusion for Low-Quality Multimodal Data

The inherent challenge of multimodal fusion is to precisely capture the ...

A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition

Motion recognition is a promising direction in computer vision, but the ...

Enhanced Multimodal Representation Learning with Cross-modal KD

This paper explores the tasks of leveraging auxiliary modalities which a...

TCT: A Cross-supervised Learning Method for Multimodal Sequence Representation

Multimodalities provide promising performance than unimodality in most t...

Please sign up or login with your details

Forgot password? Click here to reset