Deep Indexed Active Learning for Matching Heterogeneous Entity Representations

by   Arjit Jain, et al.

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on tasks like ER require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful transformer models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at


page 1

page 2

page 3

page 4


Low-resource Deep Entity Resolution with Transfer and Active Learning

Entity resolution (ER) is the task of identifying different representati...

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Entity Matching (EM) is a core data cleaning task, aiming to identify di...

Pre-trained Language Model Based Active Learning for Sentence Matching

Active learning is able to significantly reduce the annotation cost for ...

Cross-Language Learning for Entity Matching

Transformer-based matching methods have significantly moved the state-of...

Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Product matching is a fundamental step for the global understanding of c...

Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making

Entity Matching (EM) aims at recognizing entity records that denote the ...

Please sign up or login with your details

Forgot password? Click here to reset