MIMIC-IT: Multi-Modal In-Context Instruction Tuning

by   Bo Li, et al.

High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.


page 2

page 4

page 7

page 8

page 18

page 19

page 20

page 22


Otter: A Multi-Modal Model with In-Context Instruction Tuning

Large language models (LLMs) have demonstrated significant universal cap...

TUTORING: Instruction-Grounded Conversational Agent for Language Learners

In this paper, we propose Tutoring bot, a generative chatbot trained on ...

Visual Instruction Tuning with Polite Flamingo

Recent research has demonstrated that the multi-task fine-tuning of mult...

CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning

Nowadays, the research on Large Vision-Language Models (LVLMs) has been ...

LMEye: An Interactive Perception Network for Large Language Models

Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, ...

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable ad...

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Mod...

Please sign up or login with your details

Forgot password? Click here to reset