Wasserstein K-means for clustering probability distributions

09/14/2022
by   Yubo Zhuang, et al.
0

Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used K-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the K-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based K-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein K-means can achieve exact recovery given the clusters are well-separated under the 2-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based K-means can achieve better classification performance over the standard centroid-based K-means for clustering probability distributions and images.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2018

Minimax Distribution Estimation in Wasserstein Distance

The Wasserstein metric is an important measure of distance between proba...
research
12/26/2022

Covariance-based soft clustering of functional data based on the Wasserstein-Procrustes metric

We consider the problem of clustering functional data according to their...
research
09/09/2021

On the use of Wasserstein metric in topological clustering of distributional data

This paper deals with a clustering algorithm for histogram data based on...
research
10/22/2021

Clustering Market Regimes using the Wasserstein Distance

The problem of rapid and automated detection of distinct market regimes ...
research
06/07/2020

Information Mandala: Statistical Distance Matrix with Its Clustering

In machine learning, observation features are measured in a metric space...
research
04/13/2023

A Natural Copula

Copulas are widely used in financial economics as well as in other areas...
research
03/11/2019

Diffusion K-means clustering on manifolds: provable exact recovery via semidefinite relaxations

We introduce the diffusion K-means clustering method on Riemannian subm...

Please sign up or login with your details

Forgot password? Click here to reset