LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

02/06/2023
by   Ziyang Luo, et al.
0

Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. This framework features lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, allowing for constructing continuous bag-of-words bottlenecks to learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5   221.3X faster retrieval speed and 13.2   48.8X less index storage memory.

READ FULL TEXT
research
08/31/2022

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

In large-scale retrieval, the lexicon-weighting paradigm, learning weigh...
research
09/04/2023

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

The pre-training task is indispensable for the text-to-image person re-i...
research
12/20/2022

What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

Dual encoders are now the dominant architecture for dense retrieval. Yet...
research
07/17/2014

Efficient On-the-fly Category Retrieval using ConvNets and GPUs

We investigate the gains in precision and speed, that can be obtained by...
research
05/24/2022

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...
research
06/05/2023

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

In this paper, we introduce a large Multi-Attribute and Language Search ...
research
05/18/2023

Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora

In this paper, we unveil a groundbreaking method to amplify full-text se...

Please sign up or login with your details

Forgot password? Click here to reset