Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds
Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. In this paper, we propose a novel and efficient clustering algorithm for n points drawn from a mixture of two Gaussian distributions in R^p. The algorithm involves performing random 1-dimensional projections until a direction is found that yields the user-specified clustering error e. For a 1-dimensional separability parameter γ satisfying γ=Q^-1(e), the expected number of such projections is shown to be bounded by o( p), when γ satisfies γ≤ cp, with c as the separability parameter of the two Gaussians in R^p. It is shown that the square of the 1-dimensional separability resulting from a random projection is in expectation equal to c^2, thus guaranteeing a small number of projections in realistic scenarios. Consequently, the expected overall running time of the algorithm is linear in n and quasi-linear in p. This result stands in contrast to prior works which learn the parameters of the Gaussian mixture model and provide polynomial or at-best quadratic running time in p and n. The new scheme is particularly appealing in the challenging setup where the ambient dimension of the data, p, is very large and yet the number of sample points, n, is small or of the same order as p. We show that the bound on the expected number of 1-dimensional projections extends to the case of three or more Gaussian mixture distributions. Finally, we validate these results with numerical experiments in which the proposed algorithm is shown to perform within the prescribed accuracy and running time bounds.
READ FULL TEXT