VLG-Net: Video-Language Graph Matching Network for Video Grounding

11/19/2020
by   Sisi Qu, et al.
0

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands the understanding of videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the domains, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs, built on top of video snippets and query tokens separately, which are used for modeling the intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with natural language queries: ActivityNet-Captions, TACoS, and DiDeMo.

READ FULL TEXT

page 1

page 8

research
06/06/2019

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Query-based moment retrieval aims to localize the most relevant moment i...
research
10/12/2021

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

This paper focuses on tackling the problem of temporal language localiza...
research
03/14/2023

Generation-Guided Multi-Level Unified Network for Video Grounding

Video grounding aims to locate the timestamps best matching the query de...
research
09/10/2021

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Temporal grounding aims to localize a video moment which is semantically...
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
08/06/2020

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...
research
08/14/2023

Knowing Where to Focus: Event-aware Transformer for Video Grounding

Recent DETR-based video grounding models have made the model directly pr...

Please sign up or login with your details

Forgot password? Click here to reset