ROC曲线下的区域作为聚类质量的度量

论文标题

ROC曲线下的区域作为聚类质量的度量

The Area Under the ROC Curve as a Measure of Clustering Quality

论文作者

Jaskowiak, Pablo Andretta, Costa, Ivan Gesteira, Campello, Ricardo José Gabrielli Barreto

论文摘要

接收器操作特性（ROC）曲线下的区域（称为AUC）是监督学习域中的众所周知的性能度量。由于其引人注目的功能，它已在许多研究中用于评估和比较不同分类器的性能。在这项工作中，我们在集群分析的背景下更具体地说，在无监督的学习领域中探索AUC作为绩效度量。特别是，我们详细阐述了AUC作为聚类质量的内部/相对度量，我们将其称为曲线下的区域（AUCC）。我们表明，在随机聚类解决方案的零模型下，给定的候选聚类解决方案的AUCC具有预期值，无论数据集的大小如何，更重要的是，无论评估中群集的数量或（IM）平衡的数量或（IM）平衡如何。此外，我们阐述了以下事实：在我们考虑的内部/相对聚类验证的背景下，AUCC实际上是Baker and Hubert（1975）的伽马标准的线性转换，为此，我们还正式得出了机会聚类的理论预期值。我们还讨论了这些标准的计算复杂性，并表明，对于大多数群集分析的真实应用，伽马的普通实现可能是计算上的过度和不切实际的，但其与AUCC的等效性实际上公布了更有效的算法程序。我们的理论发现得到了实验结果的支持。这些结果表明，除了AUCC提供的有效且可靠的定量评估外，对ROC曲线本身的视觉检查也可以从更广泛的定性观点进一步评估候选聚类解决方案也很有用。

The Area Under the the Receiver Operating Characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题