无固定k-均值与对抗命令的聚类

论文标题

无固定k-均值与对抗命令的聚类

No-substitution k-means Clustering with Adversarial Order

论文作者

Bhattacharjee, Robi, Moshkovitz, Michal

论文摘要

当输入到达\ emph {intunary}订单时，我们研究了在线无固定设置中的$ k $ -means聚类。在这种情况下，点接一个地到达，并且需要算法才能立即决定是否在观察下一点之前将当前点作为中心。决定是不可撤销的。目的是最大程度地减少中心数量和$ k $ - 均值的成本。此设置中的先前工作假定输入的顺序是随机的，或者输入的纵横比是有限的。众所周知，如果该顺序是任意的，并且在输入上没有假设，则任何算法都必须将所有点作为中心占据。此外，假设有界的纵横比过于限制 - 它不包括由混合模型产生的自然输入。我们引入了一种新的复杂度度量，该度量量化了以任意顺序到达数据集的难度。我们设计了一种新的随机算法，并证明，如果应用于具有复杂性$ d $的数据，则该算法采用$ O（d \ log（n）k \ log（k））$中心，并且是$ o（k^3）$ - 近似值。 We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $ \ text {poly}（k）$ - 在负面结果方面，我们证明实现$α$ - approximation所需的中心数量至少为$ω\ left（\ frac {d} {k \ log log（nα）} \ right）$。

We investigate $k$-means clustering in the online no-substitution setting when the input arrives in \emph{arbitrary} order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the $k$-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive -- it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity $d$, the algorithm takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $\text{poly}(k)$-approximation. In terms of negative results, we prove that the number of centers needed to achieve an $α$-approximation is at least $Ω\left(\frac{d}{k\log(nα)}\right)$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题