Efficient Computation of Multiple Density-Based Clustering Hierarchies

HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While the performance of HDBSCAN* is robust w.r.t. mpts in the sense that a small change in mpts typically leads to only a small or no change in the clustering structure, choosing a "good" mpts value can be challenging: depending on the data distribution, a high or low value for mpts may be more appropriate, and certain data clusters may reveal themselves at different values of mpts. To explore results for a range of mpts values, however, one has to run HDBSCAN* for each value in the range independently, which is computationally inefficient. In this paper, we propose an efficient approach to compute all HDBSCAN* hierarchies for a range of mpts values by replacing the graph used by HDBSCAN* with a much smaller graph that is guaranteed to contain the required information. An extensive experimental evaluation shows that with our approach one can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about 2 times.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2018

Linear density-based clustering with a discrete density model

Density-based clustering techniques are used in a wide range of data min...
research
02/16/2013

Clustering validity based on the most similarity

One basic requirement of many studies is the necessity of classifying da...
research
06/24/2019

Density-based Clustering with Best-scored Random Forest

Single-level density-based approach has long been widely acknowledged to...
research
04/21/2021

Skeleton Clustering: Dimension-Free Density-based Clustering

We introduce a density-based clustering method called skeleton clusterin...
research
03/17/2021

DomainNet: Homograph Detection for Data Lake Disambiguation

Modern data lakes are deeply heterogeneous in the vocabulary that is use...
research
10/16/2019

FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance

FISHDBC is a flexible, incremental, scalable, and hierarchical density-b...

Please sign up or login with your details

Forgot password? Click here to reset