Are Multimodal Models Robust to Image and Text Perturbations?

12/15/2022
by   Jielin Qiu, et al.
11

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating their robustness against distribution shifts is crucial before adopting them in real-world applications. In this paper, we investigate the robustness of 9 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI and MOR) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models.

READ FULL TEXT

page 20

page 21

page 23

page 25

page 26

page 27

page 28

page 29

research
03/15/2023

PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

Vulnerability to lexical perturbation is a critical weakness of automati...
research
07/05/2022

Multi-modal Robustness Analysis Against Language and Visual Perturbations

Joint visual and language modeling on large-scale datasets has recently ...
research
04/20/2023

Enhancing object detection robustness: A synthetic and natural perturbation approach

Robustness against real-world distribution shifts is crucial for the suc...
research
08/09/2023

TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design

Text design is one of the most critical procedures in poster design, as ...
research
05/25/2022

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...
research
05/09/2023

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Webpages have been a rich resource for language and vision-language task...
research
04/28/2022

GRIT: General Robust Image Task Benchmark

Computer vision models excel at making predictions when the test distrib...

Please sign up or login with your details

Forgot password? Click here to reset