StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

by   Yanda Li, et al.
University of Technology Sydney
Nanyang Technological University
Apple Inc

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,


page 1

page 3

page 4

page 5

page 8

page 9


Instruction Tuning for Large Language Models: A Survey

This paper surveys research works in the quickly advancing field of inst...

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Instruction finetuning is a popular paradigm to align large language mod...

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significa...

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Visual instruction tuning has recently shown encouraging progress with o...

Towards an On-device Agent for Text Rewriting

Large Language Models (LLMs) have demonstrated impressive capabilities f...

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Large language models (LLMs) have demonstrated remarkable language abili...

PIVOINE: Instruction Tuning for Open-world Information Extraction

We consider the problem of Open-world Information Extraction (Open-world...

Please sign up or login with your details

Forgot password? Click here to reset