Domain Adaptation for Memory-Efficient Dense Retrieval

05/23/2022
by   Nandan Thakur, et al.
9

Dense retrievers encode documents into fixed dimensional embeddings. However, storing all the document embeddings within an index produces bulky indexes which are expensive to serve. Recently, BPR (Yamada et al., 2021) and JPQ (Zhan et al., 2021a) have been proposed which train the model to produce binary document vectors, which reduce the index 32x and more. The authors showed these binary embedding models significantly outperform more traditional index compression techniques like Product Quantization (PQ). Previous work evaluated these approaches just in-domain, i.e. the methods were evaluated on tasks for which training data is available. In practice, retrieval models are often used in an out-of-domain setting, where they have been trained on a publicly available dataset, like MS MARCO, but are then used for some custom dataset for which no training data is available. In this work, we show that binary embedding models like BPR and JPQ can perform significantly worse than baselines once there is a domain-shift involved. We propose a modification to the training procedure of BPR and JPQ and combine it with a corpus specific generative procedure which allow the adaptation of BPR and JPQ to any corpus without requiring labeled training data. Our domain-adapted strategy known as GPL is model agnostic, achieves an improvement by up-to 19.3 and 11.6 points in nDCG@10 across the BEIR benchmark in comparison to BPR and JPQ while maintaining its 32x memory efficiency. JPQ+GPL even outperforms our upper baseline: uncompressed TAS-B model on average by 2.0 points.

READ FULL TEXT
research
12/14/2021

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Dense retrieval approaches can overcome the lexical gap and lead to sign...
research
11/01/2021

Domain-adaptation of spherical embeddings

Domain adaptation of embedding models, updating a generic embedding to t...
research
10/05/2016

Neural Structural Correspondence Learning for Domain Adaptation

Domain adaptation, adapting models from domains rich in labeled training...
research
03/31/2022

Domain Adaptation for Sparse-Data Settings: What Do We Gain by Not Using Bert?

The practical success of much of NLP depends on the availability of trai...
research
05/06/2022

Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder

Dense retrievers encode texts and map them in an embedding space using p...
research
05/23/2022

Non-Parametric Domain Adaptation for End-to-End Speech Translation

End-to-End Speech Translation (E2E-ST) has received increasing attention...
research
09/09/2021

Tiny CNN for feature point description for document analysis: approach and dataset

In this paper, we study the problem of feature points description in the...

Please sign up or login with your details

Forgot password? Click here to reset