快速而急切的K-Medoids聚类：O（K）PAM，Clara和Clarans算法的运行时间改进

论文标题

快速而急切的K-Medoids聚类：O（K）PAM，Clara和Clarans算法的运行时间改进

Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

论文作者

Schubert, Erich, Rousseeuw, Peter J.

论文摘要

聚类非欧几里得数据很困难，除了层次聚类之外，最常用的算法之一是围绕MEDOIDS（PAM）的流行算法分区，也简单地称为K-Medoids群集。在欧几里得的几何形状中，k均值中使用的平均值 - 是群集中心的良好估计器，但这并不存在于任意差异。 PAM使用Medoid，而不是与群集中所有其他相似性最小的对象。这种中心性的概念可以与任何（差异）相似性一起使用，因此与许多域和应用具有很高的相关性。 PAM的关键问题是其高运行时间成本。我们建议对PAM算法进行修改，该算法在该算法的第二个（“ swap”）阶段实现O（k） - 折叠的速度，但仍然会发现与原始PAM算法相同的结果。如果我们放松执行掉期的选择（在保持可比质量的同时），则可以通过在每次迭代中急切地执行其他掉期来进一步加速该算法。通过实质上更快的互换，我们现在可以探索更快的初始化策略，因为（i）经典（构建）初始化现在成为瓶颈，并且（ii）我们的掉期足够快以补偿较差的起始条件。我们还展示了Clara和Clarans算法如何从拟议的修改中受益。尽管我们不研究这项工作中方法的并行化，但可以很容易地将其与早期使用PAM和Clara在大数据上使用的方法相结合（其中一些将PAM用作子例程，因此可以立即从这些改进中受益），在这些改进中，高k的性能变得越来越重要。在对k = 100,200的实际数据实验中，我们观察到与原始的PAM交换算法相比，分别观察到458倍的1191x速度，使PAM适用于较大的数据集，尤其是较高的K。

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

下载PDF全文

下载文献需遵守相关版权规定

论文标题