Unified Coarse-to-Fine Alignment for Video-Text Retrieval

09/18/2023
by   Ziyang Wang, et al.
0

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4 1.4 Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.

READ FULL TEXT

page 1

page 15

research
07/15/2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Video-text retrieval has been a crucial and fundamental task in multi-mo...
research
11/21/2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

In this paper we tackle the cross-modal video retrieval problem and, mor...
research
05/20/2023

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Text-video retrieval is a challenging cross-modal task, which aims to al...
research
11/01/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity...
research
02/18/2021

Hierarchical Similarity Learning for Language-based Product Image Retrieval

This paper aims for the language-based product image retrieval task. The...
research
02/19/2023

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

While recent progress in video-text retrieval has been advanced by the e...
research
07/25/2023

Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection

Human-Object Interaction (HOI) detection is a challenging computer visio...

Please sign up or login with your details

Forgot password? Click here to reset