DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

03/17/2023
by   Peng Jin, et al.
0

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at https://github.com/jpthu17/DiffusionRet.

READ FULL TEXT

page 13

page 14

research
08/22/2023

Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of ma...
research
01/10/2022

Multi-query Video Retrieval

Retrieving target videos based on text descriptions is a task of great p...
research
10/05/2022

GMMSeg: Gaussian Mixture based Generative Semantic Segmentation Models

Prevalent semantic segmentation solutions are, in essence, a dense discr...
research
11/21/2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

In this paper we tackle the cross-modal video retrieval problem and, mor...
research
06/28/2022

Joint Generator-Ranker Learning for Natural Language Generation

Due to exposure bias, most existing natural language generation (NLG) mo...
research
01/29/2023

HeroNet: A Hybrid Retrieval-Generation Network for Conversational Bots

Using natural language, Conversational Bot offers unprecedented ways to ...
research
03/16/2022

Learning video retrieval models with relevance-aware online mining

Due to the amount of videos and related captions uploaded every hour, de...

Please sign up or login with your details

Forgot password? Click here to reset