Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

by   Shengyao Zhuang, et al.

The Differentiable Search Index (DSI) is a new, emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between long document texts and their identifies, but then at retrieval, short query texts are provided to DSI models to perform the retrieval of the document identifiers. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models we propose a simple yet effective indexing framework for DSI called DSI-QG. In DSI-QG, documents are represented by a number of relevant queries generated by a query generation model at indexing time. This allows DSI models to connect a document identifier to a set of query texts when indexing, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval benchmark datasets show that DSI-QG significantly outperforms the original DSI model.


page 1

page 2

page 3

page 4


Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense Retrieval

Effective cross-lingual dense retrieval methods that rely on multilingua...

Understanding Differential Search Index for Text Retrieval

The Differentiable Search Index (DSI) is a novel information retrieval (...

An Analysis of Indexing and Querying Strategies on a Technologically Assisted Review Task

This paper presents a preliminary experimentation study using the CLEF 2...

Doc2Query–: When Less is More

Doc2Query – the process of expanding the content of a document before in...

Generative Retrieval as Dense Retrieval

Generative retrieval is a promising new neural retrieval paradigm that a...

An Efficient Indexing and Searching Technique for Information Retrieval for Urdu Language

Indexing techniques are used to improve retrieval of data in response to...

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

Recently, a new paradigm called Differentiable Search Index (DSI) has be...

Please sign up or login with your details

Forgot password? Click here to reset