Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

08/12/2021
by   Luyu Gao, et al.
0

Recent research demonstrates the effectiveness of using fine-tuned language models (LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i) fragility to training data noise and ii) requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

Is Your Language Model Ready for Dense Representation Fine-tuning?

Pre-trained language models (LM) have become go-to text representation e...
research
12/16/2021

Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Information retrieval is an important component in natural language proc...
research
04/17/2023

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effective...
research
06/05/2023

Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training

Dense retrievers have achieved impressive performance, but their demand ...
research
03/11/2022

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Recent rapid advancements in deep pre-trained language models and the in...
research
06/20/2018

Injecting Relational Structural Representation in Neural Networks for Question Similarity

Effectively using full syntactic parsing information in Neural Networks ...
research
05/22/2023

LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space

In recent years, large language models (LMs) have achieved remarkable pr...

Please sign up or login with your details

Forgot password? Click here to reset