Hashing-Based Distributed Clustering for Massive High-Dimensional Data

06/30/2023
by   Yifeng Xiao, et al.
0

Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2020

CIMON: Towards High-quality Hash Codes

Recently, hashing is widely-used in approximate nearest neighbor search ...
research
01/11/2017

Stochastic Generative Hashing

Learning-based binary hashing has become a powerful paradigm for fast se...
research
10/01/2018

Fusion Hashing: A General Framework for Self-improvement of Hashing

Hashing has been widely used for efficient similarity search based on it...
research
10/10/2020

Making Online Sketching Hashing Even Faster

Data-dependent hashing methods have demonstrated good performance in var...
research
05/05/2019

Fast communication-efficient spectral clustering over distributed data

The last decades have seen a surge of interests in distributed computing...
research
09/04/2015

CNN Based Hashing for Image Retrieval

Along with data on the web increasing dramatically, hashing is becoming ...
research
10/22/2018

Norm-Range Partition: A Univiseral Catalyst for LSH based Maximum Inner Product Search (MIPS)

Recently, locality sensitive hashing (LSH) was shown to be effective for...

Please sign up or login with your details

Forgot password? Click here to reset