Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

by   Juncheng Li, et al.

Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.


page 2

page 5

page 6

page 9


Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated ...

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

3D scene understanding has gained significant attention due to its wide ...

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

General-purpose language models that can solve various language-domain t...

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 ha...

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

In this paper, we introduce the Instruction Following Score (IFS), a met...

Moment-based Adversarial Training for Embodied Language Comprehension

In this paper, we focus on a vision-and-language task in which a robot i...

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning

Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative i...

Please sign up or login with your details

Forgot password? Click here to reset