Self-Chained Image-Language Model for Video Localization and Question Answering

by   Shoubin Yu, et al.

Recent studies have shown promising results on utilizing pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We chain these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. SeViLA outperforms several strong baselines/previous works on five video QA and event prediction tasks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We show a comprehensive analysis, e.g., the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.


page 5

page 11


Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

While recent large-scale video-language pre-training made great progress...

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Video question answering (VideoQA) is a complex task that requires diver...

SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering

Video question–answering is a fundamental task in the field of video und...

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

The pre-training-fine-tuning paradigm based on layout-aware multimodal p...

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Video question answering (VideoQA) is an essential task in vision-langua...

Please sign up or login with your details

Forgot password? Click here to reset