Zero-Shot Temporal Action Detection via Vision-Language Prompting

07/17/2022
by   Sauradip Nag, et al.
30

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available at https://github.com/sauradip/STALE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2022

Semi-Supervised Temporal Action Detection with Proposal-Free Masking

Existing temporal action detection (TAD) methods rely on a large number ...
research
08/06/2020

Zero-Shot Multi-View Indoor Localization via Graph Location Networks

Indoor localization is a fundamental problem in location-based applicati...
research
07/28/2017

Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions

We aim for zero-shot localization and classification of human actions in...
research
03/21/2023

Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization u...
research
11/27/2022

Multi-Modal Few-Shot Temporal Action Detection

Few-shot (FS) and zero-shot (ZS) learning are two different approaches f...
research
01/27/2021

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

We introduce SynSE, a novel syntactically guided generative approach for...
research
04/16/2019

Decoupling Localization and Classification in Single Shot Temporal Action Detection

Video temporal action detection aims to temporally localize and recogniz...

Please sign up or login with your details

Forgot password? Click here to reset