论文标题

UMAP辅助$ K $ - 大规模SARS-COV-2突变数据集的聚类

UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets

论文作者

Hozumi, Yuta, Wang, Rui, Yin, Changchuan, Wei, Guo-Wei

论文摘要

2019年冠状病毒病(COVID-19)由严重的急性呼吸综合征2(SARS-COV-2)引起,具有全球毁灭性作用。对SARS-COV-2的进化和传播的理解对于19号控制,打击和预防至关重要。由于SARS-COV-2基因组序列的数量和独特突变的数量的快速增长,SARS-COV-2基因组分离株的系统发育分析面临着出现的大数据挑战。我们引入了降低维度的$ k $ - 均值聚类策略,以应对这一挑战。我们研究了三维还原算法的性能和有效性:主成分分析(PCA),T分布的随机邻居嵌入(T-SNE)以及统一的歧管近似和投影(UMAP)。通过使用四个基准数据集,我们发现UMAP是最合适的技术,这是由于其稳定,可靠和有效的性能,其提高聚类准确性的能力,尤其是对于大型Jaccard基于较大的基于Jaccard的基于距离的数据集以及其出色的聚类可视化。 UMAP辅助$ K $ - 均值聚类使我们能够揭示SARS-COV-2基因组分离株越来越大的数据集。

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源