论文标题
离散指标的内在维度估计
Intrinsic dimension estimation for discrete metrics
论文作者
论文摘要
以离散特征为特征的现实世界数据集无处不在:从分类调查到临床问卷,从未加权网络到DNA序列。然而,最常见的无监督尺寸还原方法是为连续空间设计的,它们用于离散空间可能会导致错误和偏见。在这封信中,我们介绍了一种算法,以推断嵌入离散空间中的数据集的固有维度(ID)。我们证明了它在基准数据集上的准确性,并将其应用于分析元基因组数据集,用于物种指纹识别,发现了一个令人惊讶的小ID。
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.