Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

03/15/2023
by   Yulin Pan, et al.
0

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (e.g., in minutes), temporal grounding in long videos (e.g., in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with one-time network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (i.e., anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-of-the-arts, and achieves 14.6× / 102.8× higher efficiency respectively. The code will be released at <https://github.com/afcedf/SOONet.git>

READ FULL TEXT

page 1

page 3

page 8

page 11

page 12

research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
03/24/2022

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Temporal grounding in videos aims to localize one target video segment t...
research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...
research
04/07/2021

Deep Transformers for Fast Small Intestine Grounding in Capsule Endoscope Video

Capsule endoscopy is an evolutional technique for examining and diagnosi...
research
04/16/2020

Local-Global Video-Text Interactions for Temporal Grounding

This paper addresses the problem of text-to-video temporal grounding, wh...
research
01/21/2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

The task of video grounding, which temporally localizes a natural langua...

Please sign up or login with your details

Forgot password? Click here to reset