FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

by   Santiago Castro, et al.
University of Michigan

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval. However, these models haven't been adapted to video, mainly because they don't account for the time dimension but also because video frames are different from the typical images (e.g., containing motion blur, less sharpness). In this paper, we present a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks. We show that by carefully adapting these models we obtain considerable improvements on two zero-shot Action Recognition tasks and three zero-shot Text-to-video Retrieval tasks. The code is available at


MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Large scale Vision-Language (VL) models have shown tremendous success in...

RareAct: A video dataset of unusual interactions

This paper introduces a manually annotated video dataset of unusual acti...

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified mode...

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Adopting contrastive image-text pretrained models like CLIP towards vide...

Understanding Zero-Shot Adversarial Robustness for Large-Scale Models

Pretrained large-scale vision-language models like CLIP have exhibited s...

TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot

Tongue segmentation serves as the primary step in automated TCM tongue d...

Zero-shot Video Moment Retrieval With Off-the-Shelf Models

For the majority of the machine learning community, the expensive nature...

Please sign up or login with your details

Forgot password? Click here to reset