Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

05/25/2023
by   Zheyuan Liu, et al.
0

Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.

READ FULL TEXT

page 9

page 14

research
03/29/2023

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Composed image retrieval searches for a target image based on a multi-mo...
research
10/21/2019

Designovel's system description for Fashion-IQ challenge 2019

This paper describes Designovel's systems which are submitted to the Fas...
research
04/29/2022

Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval

Current text-image approaches (e.g., CLIP) typically adopt dual-encoder ...
research
08/09/2021

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

We extend the task of composed image retrieval, where an input query con...
research
09/05/2023

Dual Relation Alignment for Composed Image Retrieval

Composed image retrieval, a task involving the search for a target image...
research
05/17/2023

Self-Training Boosted Multi-Faceted Matching Network for Composed Image Retrieval

The composed image retrieval (CIR) task aims to retrieve the desired tar...
research
09/04/2023

Target-Guided Composed Image Retrieval

Composed image retrieval (CIR) is a new and flexible image retrieval par...

Please sign up or login with your details

Forgot password? Click here to reset