论文标题

通过拓扑流程学习增强聚类分析

Enhancing cluster analysis via topological manifold learning

论文作者

Herrmann, Moritz, Kazempour, Daniyal, Scheipl, Fabian, Kröger, Peer

论文摘要

我们讨论聚类分析的拓扑方面,并表明在聚类之前推断数据集的拓扑结构可以大大增强群集检测:理论论证和经验证据表明,嵌入矢量的聚类嵌入矢量,代表数据歧管的结构,而不是观察到的特征矢量本身,是高度好处。为了证明,我们将歧管学习方法与基于密度的聚类方法dbscan结合了歧管学习方法UMAP。合成和实际数据结果表明,这既简化并改善了多种低维问题,包括密度变化和/或纠缠形状的簇。我们的方法简化了聚类,因为拓扑预处理一致地降低了DBSCAN的参数灵敏度。然后,用DBSCAN聚类所得的嵌入可以超过诸如Spectacl和clustergan之类的复杂方法。最后,我们的调查表明,聚类中的关键问题似乎不是数据的标称维度或其中包含多少个无关的功能,而是如何\ textit {可分离}群集在它们嵌入的环境观察空间中,通常是(高维度)e核e核空间由数据定义的(高维)e核空间。我们的方法之所以成功,是因为我们将数据投影到更合适的空间后,从某种意义上说,我们执行了群集分析。

We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源