Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!

10/13/2020
by   Jack Hessel, et al.
0

Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

READ FULL TEXT
research
09/30/2019

Diachronic Cross-modal Embeddings

Understanding the semantic shifts of multimodal information is only poss...
research
09/02/2017

XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification

We propose two multimodal deep learning architectures that allow for cro...
research
06/12/2018

Attentive cross-modal paratope prediction

Antibodies are a critical part of the immune system, having the function...
research
06/30/2022

MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models

The promise of multimodal models for real-world applications has inspire...
research
04/19/2019

EmbraceNet: A robust deep learning architecture for multimodal classification

Classification using multimodal data arises in many machine learning app...
research
05/15/2021

Premise-based Multimodal Reasoning: A Human-like Cognitive Process

Reasoning is one of the major challenges of Human-like AI and has recent...
research
09/02/2021

AnANet: Modeling Association and Alignment for Cross-modal Correlation Classification

The explosive increase of multimodal data makes a great demand in many c...

Please sign up or login with your details

Forgot password? Click here to reset