论文标题

一种具有概念漂移检测的新型增量聚类技术

A Novel Incremental Clustering Technique with Concept Drift Detection

论文作者

Woodbright, Mitchell D., Rahman, Md Anisur, Islam, Md Zahidul

论文摘要

数据是从生活的各个方面收集的。这些数据通常可以在块/批次中到达。传统的静态聚类算法不适用于动态数据集,即当数据到达块/批次流中时。如果我们在组合数据集上应用常规聚类技术,那么每次出现新数据时,该过程可能会缓慢而浪费。此外,由于其不断增加的尺寸,将组合数据集存储在内存中可能是一项挑战。结果,已经提出了各种增量聚类技术。每当到达新批次时,这些技术都需要有效地更新当前的聚类结果,以通过最新数据调整当前的聚类结果/解决方案。当新批次的聚类模式与较旧批次显着不同时,这些技术还需要能够检测概念漂移的能力。有时,聚类模式可能会暂时在一批中暂时漂移,而下一批则不会表现出漂移。因此,增量聚类技术需要检测暂时漂移和持续漂移的能力。在本文中,我们提出了一种称为uiclust的有效增量聚类算法。它旨在聚集数据块的流,即使存在暂时或持续的概念漂移。我们通过将Uiclust与最近发表的高质量增量聚类算法进行比较来评估UIClust的性能。我们使用真实和合成数据集。我们通过使用众所周知的聚类评估标准来比较结果:熵,平方错误(SSE)和执行时间。我们的结果表明,在我们所有实验中,Uiclust优于现有技术。

Data are being collected from various aspects of life. These data can often arrive in chunks/batches. Traditional static clustering algorithms are not suitable for dynamic datasets, i.e., when data arrive in streams of chunks/batches. If we apply a conventional clustering technique over the combined dataset, then every time a new batch of data comes, the process can be slow and wasteful. Moreover, it can be challenging to store the combined dataset in memory due to its ever-increasing size. As a result, various incremental clustering techniques have been proposed. These techniques need to efficiently update the current clustering result whenever a new batch arrives, to adapt the current clustering result/solution with the latest data. These techniques also need the ability to detect concept drifts when the clustering pattern of a new batch is significantly different from older batches. Sometimes, clustering patterns may drift temporarily in a single batch while the next batches do not exhibit the drift. Therefore, incremental clustering techniques need the ability to detect a temporary drift and sustained drift. In this paper, we propose an efficient incremental clustering algorithm called UIClust. It is designed to cluster streams of data chunks, even when there are temporary or sustained concept drifts. We evaluate the performance of UIClust by comparing it with a recently published, high-quality incremental clustering algorithm. We use real and synthetic datasets. We compare the results by using well-known clustering evaluation criteria: entropy, sum of squared errors (SSE), and execution time. Our results show that UIClust outperforms the existing technique in all our experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源