d-blink: Distributed End-to-End Bayesian Entity Resolution

09/13/2019
by   Neil G. Marchant, et al.
3

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink – the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains – in excess of 300×– when compared to existing non-distributed methods.

READ FULL TEXT

page 39

page 40

page 41

page 42

research
10/17/2014

Variational Bayes for Merging Noisy Databases

Bayesian entity resolution merges together multiple, noisy databases and...
research
01/08/2023

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Entity resolution (record linkage or deduplication) is the process of id...
research
05/14/2019

Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers

Probabilistic record linkage (PRL) is the process of determining which r...
research
03/08/2017

Performance Bounds for Graphical Record Linkage

Record linkage involves merging records in large, noisy databases to rem...
research
10/11/2018

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Record linkage (de-duplication or entity resolution) is the process of m...
research
05/28/2020

Efficient and Effective ER with Progressive Blocking

Blocking is a mechanism to improve the efficiency of Entity Resolution (...
research
12/27/2017

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Accurate and efficient entity resolution is an open challenge of particu...

Please sign up or login with your details

Forgot password? Click here to reset