MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

03/20/2023
by   Zhengyuan Yang, et al.
0

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

READ FULL TEXT

page 17

page 18

page 19

page 20

page 21

page 22

page 29

page 30

research
12/15/2022

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models Tasks

Vision and language models (VL) are known to exploit unrobust indicators...
research
08/27/2023

MM-AU:Towards Multimodal Understanding of Advertisement Videos

Advertisement videos (ads) play an integral part in the domain of Intern...
research
05/24/2023

ECHo: Event Causality Inference via Human-centric Reasoning

We introduce ECHo, a diagnostic dataset of event causality inference gro...
research
08/31/2023

TouchStone: Evaluating Vision-Language Models by Language Models

Large vision-language models (LVLMs) have recently witnessed rapid advan...
research
05/23/2023

Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction

Despite constituting 65 underrepresented in generative AI research. Mean...
research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
07/26/2023

ECO: Ensembling Context Optimization for Vision-Language Models

Image recognition has recently witnessed a paradigm shift, where vision-...

Please sign up or login with your details

Forgot password? Click here to reset