UnLoc: A Unified Framework for Video Localization Tasks

08/21/2023
by   Shen Yan, et al.
0

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: <https://github.com/google-research/scenic>.

READ FULL TEXT
research
08/06/2022

Frozen CLIP Models are Efficient Video Learners

Video recognition has been dominated by the end-to-end learning paradigm...
research
01/02/2022

TVNet: Temporal Voting Network for Action Localization

We propose a Temporal Voting Network (TVNet) for action localization in ...
research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
03/24/2021

Learning Salient Boundary Feature for Anchor-free Temporal Action Localization

Temporal action localization is an important yet challenging task in vid...
research
05/20/2022

Structured Attention Composition for Temporal Action Localization

Temporal action localization aims at localizing action instances from un...
research
05/23/2023

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Sparse algorithms offer great flexibility for multi-view temporal percep...
research
01/11/2023

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

Many models have been proposed for vision and language tasks, especially...

Please sign up or login with your details

Forgot password? Click here to reset