Clustering Perturbation Resilient Instances
Euclidean k-means is a problem that is NP-hard in the worst-case but often solved efficiently by simple heuristics in practice. This has lead researchers to study various properties of real-world data sets that allow stable optimal clusters and provably efficient, simple algorithms to recover them. We consider stable instances of Euclidean k-means that have provable polynomial time algorithms for recovering optimal cluster. These results often have assumptions about the data that either do not hold in practice or the algorithms are not practical or stable enough with running time quadratic or more in the number of points. We propose simple algorithms with running time linear in the number of points and the dimension that provably recover the optimal clustering on α-metric perturbation resilient instances of Euclidean k-means. Our results hold even when the instances satisfy α-center proximity, a weaker property that is implied by α-metric perturbation resilience. In the case when the data contains a certain class of outliers (and only the inliers satisfy α-center proximity property), we give an algorithm that outputs a small list of clusterings, one of which is guaranteed to recover the optimal clustering.
READ FULL TEXT