Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

06/08/2023
by   Muhammad Maaz, et al.
0

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

READ FULL TEXT

page 5

page 6

page 9

page 10

page 11

page 12

page 13

page 14

research
05/10/2023

VideoChat: Chat-Centric Video Understanding

In this study, we initiate an exploration into video understanding by in...
research
01/26/2023

Affective Faces for Goal-Driven Dyadic Communication

We introduce a video framework for modeling the association between verb...
research
06/12/2023

Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image...
research
05/25/2023

PandaGPT: One Model To Instruction-Follow Them All

We present PandaGPT, an approach to emPower large lANguage moDels with v...
research
05/10/2023

Bot or Human? Detecting ChatGPT Imposters with A Single Question

Large language models like ChatGPT have recently demonstrated impressive...
research
06/13/2023

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

The latest breakthroughs in large vision-language models, such as Bard a...
research
06/27/2023

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

In human conversations, individuals can indicate relevant regions within...

Please sign up or login with your details

Forgot password? Click here to reset