Record fusion: A learning approach

06/18/2020
by   Alireza Heidari, et al.
9

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of  98 information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of  20 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2017

Entity Consolidation: The Golden Record Problem

Four key processes in data integration are: data preparation (i.e., extr...
research
08/24/2020

On sampling from data with duplicate records

Data deduplication is the task of detecting records in a database that c...
research
02/07/2016

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Entity resolution (ER), an important and common data cleaning problem, i...
research
08/28/2019

On Inferring Training Data Attributes in Machine Learning Models

A number of recent works have demonstrated that API access to machine le...
research
08/23/2020

A Prior for Record Linkage Based on Allelic Partitions

In database management, record linkage aims to identify multiple records...
research
06/08/2021

Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making

Entity Matching (EM) aims at recognizing entity records that denote the ...
research
03/11/2011

SPPAM - Statistical PreProcessing AlgorithM

Most machine learning tools work with a single table where each row is a...

Please sign up or login with your details

Forgot password? Click here to reset