Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

by   Hang Zhang, et al.
Alibaba Group

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual audio encoders and the frozen LLMs. Unlike previous vision-LLMs that focus on static image comprehensions such as MiniGPT-4 and LLaVA, Video-LLaMA mainly tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble the pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual audio encoders with LLM's embedding space, we train Video-LLaMA on massive video/image-caption pairs as well as visual-instruction-tuning datasets of moderate amount but higher quality. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information presented in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants.


page 7

page 8

page 9

page 10


ImageBind-LLM: Multi-modality Instruction Tuning

We present ImageBind-LLM, a multi-modality instruction tuning method of ...

Revisiting Pre-training in Audio-Visual Learning

Pre-training technique has gained tremendous success in enhancing model ...

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Finding the right sound effects (SFX) to match moments in a video is a d...

Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models

The rapid growth and increasing popularity of incorporating additional m...

Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios

The forensic investigation of a terrorist attack poses a significant cha...

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

Understanding audio-visual content and the ability to have an informativ...

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

An audio-visual event (AVE) is denoted by the correspondence of the visu...

Please sign up or login with your details

Forgot password? Click here to reset