Universal Multi-Modality Retrieval with One Unified Embedding Space

09/01/2022
by   Zhenghao Liu, et al.
0

This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modality retrieval. VL-UnivSearch encodes query and multi-modality sources in a universal embedding space for searching related candidates and routing modalities. To learn a tailored embedding space for multi-modality retrieval, VL-UnivSearch proposes two techniques: 1) Universal embedding optimization, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. VL-UnivSearch achieves the state-of-the-art on the multi-modality open-domain question answering benchmark, WebQA, and outperforms all retrieval models in each single modality task. It demonstrates that universal multi-modality search is feasible to replace the divide-and-conquer pipeline with a united model and also benefit per modality tasks. All source codes of this work will be released via Github.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset