Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

03/01/2020
by   Shizhe Chen, et al.
13

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences.

READ FULL TEXT

page 8

page 11

research
08/09/2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval betw...
research
10/29/2021

Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval

The task of cross-modal retrieval between texts and videos aims to under...
research
11/11/2019

Interactive Attention for Semantic Text Matching

Semantic text matching, which matches a target text to a source text, is...
research
06/24/2022

Text-Driven Stylization of Video Objects

We tackle the task of stylizing video objects in an intuitive and semant...
research
09/28/2022

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

Most existing methods in vision-language retrieval match two modalities ...
research
04/11/2019

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a jo...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...

Please sign up or login with your details

Forgot password? Click here to reset