Multi-modal Prompting for Low-Shot Temporal Action Localization

03/21/2023
by   Chen Ju, et al.
1

In this paper, we consider the problem of temporal action localization under low-shot (zero-shot few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.

READ FULL TEXT

page 2

page 11

research
08/29/2022

CounTR: Transformer-based Generalised Visual Counting

In this paper, we consider the problem of generalised visual object coun...
research
07/17/2022

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Existing temporal action detection (TAD) methods rely on large training ...
research
12/20/2022

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Detecting actions in untrimmed videos should not be limited to a small, ...
research
10/27/2022

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

When trained at a sufficient scale, self-supervised learning has exhibit...
research
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
research
04/06/2021

Few-Shot Transformation of Common Actions into Time and Space

This paper introduces the task of few-shot common action localization in...
research
10/28/2022

OhMG: Zero-shot Open-vocabulary Human Motion Generation

Generating motion in line with text has attracted increasing attention n...

Please sign up or login with your details

Forgot password? Click here to reset