论文标题
DCMD:基于距离的分类,使用混合物分布在微生物组数据上
DCMD: Distance-based Classification Using Mixture Distributions on Microbiome Data
论文作者
论文摘要
当前的下一代测序技术进步使研究人员能够对微生物组和人类疾病进行全面的研究,最近的研究确定了许多慢性病的人类微生物组与健康成果之间的关联。但是,以稀疏性和偏度为特征的微生物组数据结构给建立有效的分类器带来了挑战。为了解决这个问题,我们提出了一种使用混合分布(DCMD)的创新方法,用于基于距离的分类。该方法旨在在使用微生物组社区数据时提高分类性能,其中预测因子由稀疏和异构计数数据组成。该方法通过估计样品数据的混合分布并将每个观察结果表示为分布,以观察计数和估计的混合物为条件来对稀疏计数的固有不确定性进行建模,然后将其用作基于距离的分类的输入。该方法将实现为K-均值和K-Nearest邻居框架,我们确定两个距离指标,产生最佳结果。使用模拟评估模型的性能,并应用于人类微生物组研究,结果与许多现有的机器学习和基于距离的方法进行了比较。与机器学习方法相比,所提出的方法具有竞争力,并且对常用的基于距离的分类器有明显的改善。适用性和鲁棒性的范围使提出的方法是使用稀疏微生物组计数数据进行分类的可行替代方法。
Current advances in next generation sequencing techniques have allowed researchers to conduct comprehensive research on microbiome and human diseases, with recent studies identifying associations between human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance when using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data, and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means and k-nearest neighbours framework and we identify two distance metrics that produce optimal results. The performance of the model is assessed using simulations and applied to a human microbiome study, with results compared against a number of existing machine learning and distance-based approaches. The proposed method is competitive when compared to the machine learning approaches and showed a clear improvement over commonly used distance-based classifiers. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data.