ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

by   Junting Pan, et al.

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small ( 8 approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency.


page 2

page 18


Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

Fine-tuning is widely used as the default algorithm for transfer learnin...

Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation For Action Recognition

Video-based action recognition is one of the most popular topics in comp...

Beyond Transfer Learning: Co-finetuning for Action Localisation

Transfer learning is the predominant paradigm for training deep networks...

Scalable Weight Reparametrization for Efficient Transfer Learning

This paper proposes a novel, efficient transfer learning method, called ...

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...

Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Applying large-scale pre-trained visual models like CLIP to few-shot act...

Introspective Cross-Attention Probing for Lightweight Transfer of Pre-trained Models

We propose InCA, a lightweight method for transfer learning that cross-a...

Please sign up or login with your details

Forgot password? Click here to reset