K均值算法的初始群集中心选择的增强方法

论文标题

K均值算法的初始群集中心选择的增强方法

An enhanced method of initial cluster center selection for K-means algorithm

论文作者

Rahman, Zillur, Hossain, Md. Sabir, Hasan, Mohammad, Imteaj, Ahmed

论文摘要

聚类是从数据集中发现可以应用于不同应用程序或分析的数据集中的模式的广泛使用的技术之一。如果不正确初始化，K-Means是最流行，最简单的聚类算法，可能会被困在本地最小值中，并且该算法的初始化是随机完成的。在本文中，我们提出了一种新的方法，以改善K-均值算法的初始簇选择。该算法基于以下事实：由于最终簇在特征空间中分离，因此必须将初始质心彼此分开。凸赫尔算法促进了前两个质心的计算，其余质心则根据与先前选择的中心的距离进行选择。为了确保每个群集选择一个中心，我们使用最近的邻居技术。为了检查我们提出的算法的鲁棒性，我们考虑了几个现实世界数据集。我们在虹膜，字母和ruspini数据中仅获得了7.33％，7.90％和0％的聚类错误，这比其他现有系统更好地证明了性能。结果表明，我们所提出的方法在簇数大于2时加速计算，优于常规K的方法。

Clustering is one of the widely used techniques to find out patterns from a dataset that can be applied in different applications or analyses. K-means, the most popular and simple clustering algorithm, might get trapped into local minima if not properly initialized and the initialization of this algorithm is done randomly. In this paper, we propose a novel approach to improve initial cluster selection for K-means algorithm. This algorithm is based on the fact that the initial centroids must be well separated from each other since the final clusters are separated groups in feature space. The Convex Hull algorithm facilitates the computing of the first two centroids and the remaining ones are selected according to the distance from previously selected centers. To ensure the selection of one center per cluster, we use the nearest neighbor technique. To check the robustness of our proposed algorithm, we consider several real-world datasets. We obtained only 7.33%, 7.90%, and 0% clustering error in Iris, Letter, and Ruspini data respectively which proves better performance than other existing systems. The results indicate that our proposed method outperforms the conventional K means approach by accelerating the computation when the number of clusters is greater than 2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题