Generation-Guided Multi-Level Unified Network for Video Grounding

03/14/2023
by   Xing Cheng, et al.
0

Video grounding aims to locate the timestamps best matching the query description within an untrimmed video. Prevalent methods can be divided into moment-level and clip-level frameworks. Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective, and they usually perform better in coarse grounding. On the other hand, clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding. In this paper, we propose a multi-level unified framework to enhance performance by leveraging the merits of both moment-level and clip-level methods. Moreover, a novel generation-guided paradigm in both levels is adopted. It introduces a multi-modal generator to produce the implicit boundary feature and clip feature, later regarded as queries to calculate the boundary scores by a discriminator. The generation-guided solution enhances video grounding from a two-unique-modals' match task to a cross-modal attention task, which steps out of the previous framework and obtains notable gains. The proposed Generation-guided Multi-level Unified network (GMU) surpasses previous methods and reaches State-Of-The-Art on various benchmarks with disparate features, e.g., Charades-STA, ActivityNet captions.

READ FULL TEXT

page 1

page 11

research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
10/12/2021

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Temporal language grounding in videos aims to localize the temporal span...
research
04/22/2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...
research
08/11/2023

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

The video grounding (VG) task aims to locate the queried action or event...
research
09/23/2021

End-to-End Dense Video Grounding via Parallel Regression

Video grounding aims to localize the corresponding video moment in an un...
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
07/20/2023

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Temporal video grounding (TVG) aims to retrieve the time interval of a l...

Please sign up or login with your details

Forgot password? Click here to reset