Exploiting Temporal Relationships in Video Moment Localization with Natural Language

08/11/2019
by   Jinsong Su, et al.
14

We address the problem of video moment localization with natural language, i.e. localizing a video segment described by a natural language sentence. While most prior work focuses on grounding the query as a whole, temporal dependencies and reasoning between events within the text are not fully considered. In this paper, we propose a novel Temporal Compositional Modular Network (TCMN) where a tree attention network first automatically decomposes a sentence into three descriptions with respect to the main event, context event and temporal signal. Two modules are then utilized to measure the visual similarity and location similarity between each segment and the decomposed descriptions. Moreover, since the main event and context event may rely on different modalities (RGB or optical flow), we use late fusion to form an ensemble of four models, where each model is independently trained by one combination of the visual input. Experiments show that our model outperforms the state-of-the-art methods on the TEMPO dataset.

READ FULL TEXT

page 1

page 4

page 7

research
12/04/2019

Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Temporal grounding entails establishing a correspondence between natural...
research
10/13/2020

DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video

This paper studies the task of temporal moment localization in a long un...
research
08/20/2019

Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

This paper studies the problem of temporal moment localization in a long...
research
08/04/2017

Localizing Moments in Video with Natural Language

We consider retrieving a specific temporal segment, or moment, from a vi...
research
08/14/2023

Knowing Where to Focus: Event-aware Transformer for Video Grounding

Recent DETR-based video grounding models have made the model directly pr...
research
08/05/2021

Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests

Video abnormal event detection (VAD) is a vital semi-supervised task tha...
research
06/03/2021

SOCCER: An Information-Sparse Discourse State Tracking Collection in the Sports Commentary Domain

In the pursuit of natural language understanding, there has been a long ...

Please sign up or login with your details

Forgot password? Click here to reset