Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with Partial Query

by   Guanyu Cai, et al.

Text-based image retrieval has seen considerable progress in recent years. However, the performance of existing methods suffers in real life since the user is likely to provide an incomplete description of a complex scene, which often leads to results filled with false positives that fit the incomplete description. In this work, we introduce the partial-query problem and extensively analyze its influence on text-based image retrieval. We then propose an interactive retrieval framework called Part2Whole to tackle this problem by iteratively enriching the missing details. Specifically, an Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query based on a user-friendly interaction and statistical characteristics of the gallery. Compared to other dialog-based methods that rely heavily on the user to feed back differentiating information, we let AI take over the optimal feedback searching process and hint the user with confirmation-based questions about details. Furthermore, since fully-supervised training is often infeasible due to the difficulty of obtaining human-machine dialog data, we present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset. Experiments show that our framework significantly improves the performance of text-based image retrieval under complex scenes.


page 3

page 5

page 7

page 13

page 14

page 15


Dialog-based Interactive Image Retrieval

Existing methods for interactive image retrieval have demonstrated the m...

Scene Graph based Image Retrieval – A case study on the CLEVR Dataset

With the prolification of multimodal interaction in various domains, rec...

Automatic Query Image Disambiguation for Content-Based Image Retrieval

Query images presented to content-based image retrieval systems often ha...

Learning to Retrieve Videos by Asking Questions

The majority of traditional text-to-video retrieval systems operate in s...

Extending Cross-Modal Retrieval with Interactive Learning to Improve Image Retrieval Performance in Forensics

Nowadays, one of the critical challenges in forensics is analyzing the e...

Image Retrieval with Mixed Initiative and Multimodal Feedback

How would you search for a unique, fashionable shoe that a friend wore a...

Simple Baselines for Interactive Video Retrieval with Questions and Answers

To date, the majority of video retrieval systems have been optimized for...

Please sign up or login with your details

Forgot password? Click here to reset