Learning Unseen Modality Interaction

06/22/2023
by   Yunhua Zhang, et al.
2

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
12/08/2021

Unimodal Face Classification with Multimodal Training

Face recognition is a crucial task in various multimedia applications su...
research
10/22/2022

Greedy Modality Selection via Approximate Submodular Maximization

Multimodal learning considers learning from multi-modality data, aiming ...
research
06/05/2018

TS-Net: Combining modality specific and common features for multimodal patch matching

Multimodal patch matching addresses the problem of finding the correspon...
research
10/13/2020

Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

Natural human interactions for Mixed Reality Applications are overwhelmi...
research
01/30/2018

The New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval

Over the past decade, emoji have emerged as a new and widespread form of...
research
10/06/2019

Neural Multisensory Scene Inference

For embodied agents to infer representations of the underlying 3D physic...

Please sign up or login with your details

Forgot password? Click here to reset