Weakly Supervised Video Moment Retrieval From Text Queries

There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.

READ FULL TEXT

page 1

page 8

research
08/20/2020

Text-based Localization of Moments in a Video Corpus

Prior works on text-based video moment localization focus on temporally ...
research
09/27/2019

wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval

Given a video and a sentence, the goal of weakly-supervised video moment...
research
08/04/2017

Localizing Moments in Video with Natural Language

We consider retrieving a specific temporal segment, or moment, from a vi...
research
11/04/2021

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Video moment retrieval aims to search the moment most relevant to a give...
research
06/25/2021

Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair

In this paper we undertake the task of text-based video moment retrieval...
research
04/20/2022

Video Moment Retrieval from Text Queries via Single Frame Annotation

Video moment retrieval aims at finding the start and end timestamps of a...
research
06/21/2020

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

A system capturing the association between video frames and textual quer...

Please sign up or login with your details

Forgot password? Click here to reset