Valley: Video Assistant with Large Language model Enhanced abilitY

06/12/2023
by   Ruipu Luo, et al.
0

Recently, several multi-modal models have been developed for joint image and language understanding, which have demonstrated impressive chat abilities by utilizing advanced large language models (LLMs). The process of developing such models is straightforward yet effective. It involves pre-training an adaptation module to align the semantics of the vision encoder and language model, followed by fine-tuning on the instruction-following data. However, despite the success of this pipeline in image and language understanding, its effectiveness in joint video and language understanding has not been widely explored. In this paper, we aim to develop a novel multi-modal foundation model capable of perceiving video, image, and language within a general framework. To achieve this goal, we introduce Valley: Video Assistant with Large Language model Enhanced ability. Specifically, our proposed Valley model is designed with a simple projection module that bridges video, image, and language modalities, and is further unified with a multi-lingual LLM. We also collect multi-source vision-text pairs and adopt a spatio-temporal pooling strategy to obtain a unified vision encoding of video and image input for pre-training. Furthermore, we generate multi-task instruction-following video data, including multi-shot captions, long video descriptions, action recognition, causal relationship inference, etc. To obtain the instruction-following data, we design diverse rounds of task-oriented conversations between humans and videos, facilitated by ChatGPT. Qualitative examples demonstrate that our proposed model has the potential to function as a highly effective multilingual video assistant that can make complex video understanding scenarios easy. Code, data, and models will be available at https://github.com/RupertLuo/Valley.

READ FULL TEXT

page 4

page 6

page 7

page 8

research
04/27/2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Large language models (LLMs) have demonstrated impressive zero-shot abil...
research
05/18/2023

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Multi-modal large language models are regarded as a crucial step towards...
research
09/01/2023

Point-Bind Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

We introduce Point-Bind, a 3D multi-modality model aligning point clouds...
research
06/08/2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Conversation agents fueled by Large Language Models (LLMs) are providing...
research
06/29/2023

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Mod...
research
08/07/2023

ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

Video Action Recognition (VAR) is a challenging task due to its inherent...
research
05/10/2023

VideoChat: Chat-Centric Video Understanding

In this study, we initiate an exploration into video understanding by in...

Please sign up or login with your details

Forgot password? Click here to reset