Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

by   Zhenhailong Wang, et al.

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at .


page 7

page 17

page 18


Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

We introduce the Qwen-VL series, a set of large-scale vision-language mo...

Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to numerous tasks using only...

Large Language Models for Difficulty Estimation of Foreign Language Content with Application to Language Learning

We use large language models to aid learners enhance proficiency in a fo...

What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

Large Language Models (LLMs) with strong abilities in natural language p...

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Due to the limited scale and quality of video-text training corpus, most...

SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering

Video question–answering is a fundamental task in the field of video und...

Please sign up or login with your details

Forgot password? Click here to reset