MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models Tasks

12/15/2022
by   Letitia Parcalabescu, et al.
17

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases), instead of focusing on relevant information in each modality. A small drop in accuracy obtained on a VL task with a unimodal model suggests that so-called unimodal collapse occurred. But how to quantify the amount of unimodal collapse reliably, at dataset and instance-level, to diagnose and combat unimodal collapse in a targeted way? We present MM-SHAP, a performance-agnostic multimodality score that quantifies the proportion by which a model uses individual modalities in multimodal tasks. MM-SHAP is based on Shapley values and will be applied in two ways: (1) to compare models for their degree of multimodality, and (2) to measure the contribution of individual modalities for a given task and dataset. Experiments with 6 VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. We recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at: https://github.com/Heidelberg-NLP/MM-SHAP

READ FULL TEXT

page 14

page 17

page 18

page 20

page 21

page 23

page 24

page 25

research
03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
research
08/21/2023

Deep Metric Loss for Multimodal Learning

Multimodal learning often outperforms its unimodal counterparts by explo...
research
09/11/2023

NExT-GPT: Any-to-Any Multimodal LLM

While recently Multimodal Large Language Models (MM-LLMs) have made exci...
research
03/09/2021

SMIL: Multimodal Learning with Severely Missing Modality

A common assumption in multimodal learning is the completeness of traini...
research
04/28/2023

An Empirical Study of Multimodal Model Merging

Model merging (e.g., via interpolation or task arithmetic) fuses multipl...
research
11/01/2022

Training Vision-Language Models with Less Bimodal Supervision

Standard practice in pretraining multimodal models, such as vision-langu...
research
05/30/2023

Scalable Performance Analysis for Vision-Language Models

Joint vision-language models have shown great performance over a diverse...

Please sign up or login with your details

Forgot password? Click here to reset