论文标题

逃避基于贝叶斯模型的聚类中维度的诅咒

Escaping the curse of dimensionality in Bayesian model based clustering

论文作者

Chandra, Noirrit Kiran, Canale, Antonio, Dunson, David B.

论文摘要

贝叶斯混合模型被广泛用于与适当不确定性定量的高维数据聚类。但是,随着观测的维度的增加,后推断通常倾向于偏爱太多或太少的簇。本文通过研究具有固定样本量并增加数据维度的非标准设置中的随机分区来解释这种行为。我们提供有限样品后部倾向于将每个观测值分配给不同群集的条件,或者将所有观测值分配给与维度增长的群集相同的群集。有趣的是,只要所有可能的观察结果分配到簇中的所有可能的分区都具有积极的先前概率,并且不论真实的数据生成模型,情况就不取决于聚类的选择。然后,我们在一组低维的潜在变量上提出了一类用于贝叶斯聚类(LAMB)的潜在混合物,以诱导观察到的数据分区。该模型可以适合可扩展的后验推断,我们表明它可以避免在轻度假设下的高差异性陷阱。所提出的方法在模拟研究中具有良好的性能,并应用了基于SCRNASEQ的细胞类型的应用。

Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源