论文标题
非球形混合物的异常值聚类
Outlier-Robust Clustering of Non-Spherical Mixtures
论文作者
论文摘要
我们给出了第一个异常出现的有效算法,用于聚集$ k $统计分离的d维高斯(K-gmms)的混合物。 Concretely, our algorithm takes input an $ε$-corrupted sample from a $k$-GMM and whp in $d^{\text{poly}(k/η)}$ time, outputs an approximate clustering that misclassifies at most $k^{O(k)}(ε+η)$ fraction of the points whenever every pair of mixture components are separated by $ 1- \ exp( - \ text {poly}(k/η)^k)$总变化(电视)距离。即使以$ k = 2 $,这种结果也不知道。电视分离是统计上最弱的分离概念,并捕获了重要的特殊情况,例如混合线性回归和子空间聚类。 我们的主要概念贡献是蒸馏简单的分析特性 - (可认证的)超收缩率和2度多项式的有界方差和线性投影的抗浓度 - 对于混合模型(有效)可簇的必要和足够。结果,我们的结果扩展到了$ d $维单元球体上均匀分布的任意仿射变换的聚类混合物。即使是满足这两个分析假设的分离分布的信息理论凝聚力,在我们工作之前也不知道,并且很可能具有独立的兴趣。 我们的算法建立在最近依赖于可认证的抗浓度的作品的序列上,该作品首先在2019年在Karmarkar,Klivans和Kothari和Raghavendra和Yau中引入。这涉及给出低度的平方总和证明语句的证明,该声明将参数(即平均值和协方差)距离与总变化距离相关联,仅通过依靠超收缩和抗浓缩来提供。
We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs). Concretely, our algorithm takes input an $ε$-corrupted sample from a $k$-GMM and whp in $d^{\text{poly}(k/η)}$ time, outputs an approximate clustering that misclassifies at most $k^{O(k)}(ε+η)$ fraction of the points whenever every pair of mixture components are separated by $1-\exp(-\text{poly}(k/η)^k)$ in total variation (TV) distance. Such a result was not previously known even for $k=2$. TV separation is the statistically weakest possible notion of separation and captures important special cases such as mixed linear regression and subspace clustering. Our main conceptual contribution is to distill simple analytic properties - (certifiable) hypercontractivity and bounded variance of degree 2 polynomials and anti-concentration of linear projections - that are necessary and sufficient for mixture models to be (efficiently) clusterable. As a consequence, our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere. Even the information-theoretic clusterability of separated distributions satisfying these two analytic assumptions was not known prior to our work and is likely to be of independent interest. Our algorithms build on the recent sequence of works relying on certifiable anti-concentration first introduced in the works of Karmarkar, Klivans, and Kothari and Raghavendra, and Yau in 2019. Our techniques expand the sum-of-squares toolkit to show robust certifiability of TV-separated Gaussian clusters in data. This involves giving a low-degree sum-of-squares proof of statements that relate parameter (i.e. mean and covariances) distance to total variation distance by relying only on hypercontractivity and anti-concentration.