Deep Clustering for Data Cleaning and Integration

05/22/2023
by   Hafiz Tayyab Rauf, et al.
0

Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks still remains unexplored. In this paper, we address this gap, by investigating the impact of DC in canonical data cleaning and integration tasks, including schema inference, entity resolution and domain discovery, tasks which represent clustering form the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we also observed that the chosen embedding approaches for rows, columns, and tables significantly impacted the clustering performance.

READ FULL TEXT
research
06/26/2020

Domain Contrast for Domain Adaptive Object Detection

We present Domain Contrast (DC), a simple yet effective approach inspire...
research
11/13/2019

Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering

The Data Clustering (DC) problem is of central importance for the area o...
research
06/04/2021

Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

This paper presents a new deep clustering (DC) method called manifold-aw...
research
01/11/2022

Deep clustering with fusion autoencoder

Embracing the deep learning techniques for representation learning in cl...
research
05/16/2020

Simple, Scalable, and Stable Variational Deep Clustering

Deep clustering (DC) has become the state-of-the-art for unsupervised cl...
research
12/06/2021

Top-Down Deep Clustering with Multi-generator GANs

Deep clustering (DC) leverages the representation power of deep architec...
research
08/26/2020

Automatic Integration Issues of Tabular Data for On-Line Analysis Processing

Companies and individuals produce numerous tabular data. The objective o...

Please sign up or login with your details

Forgot password? Click here to reset