Multimodal Adaptation of CLIP for Few-Shot Action Recognition

by   Jiazheng Xing, et al.

Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.


ActionCLIP: A New Paradigm for Video Action Recognition

The canonical approach to video action recognition dictates a neural mod...

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Capitalizing on large pre-trained models for various downstream tasks of...

Knowledge Prompting for Few-shot Action Recognition

Few-shot action recognition in videos is challenging for its lack of sup...

Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation For Action Recognition

Video-based action recognition is one of the most popular topics in comp...

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

Class prototype construction and matching are core aspects of few-shot a...

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Humans can easily recognize actions with only a few examples given, whil...

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

Training an effective video action recognition model poses significant c...

Please sign up or login with your details

Forgot password? Click here to reset