论文标题
部分可观测时空混沌系统的无模型预测
Tight basis cycle representatives for persistent homology of large data sets
论文作者
论文摘要
持续性同源性(PH)是拓扑数据分析的流行工具,它发现了各种研究领域的应用。它提供了一种严格的方法,可以在离散的实验观察中计算强大的拓扑特征,该观察通常包含各种不确定性来源。尽管理论上强大,但pH却遭受了高计算成本的影响,这排除了其在大型数据集中的应用。此外,大多数使用pH的分析仅限于计算非平凡特征的存在。通常不尝试这些功能的精确本地化,因为根据定义,本地化表示不是唯一的,并且由于计算成本更高。对于科学应用,这种精确的位置是确定功能意义的正弦物质。在这里,我们提供了一种策略和算法来计算大型数据集中非平凡鲁棒特征的紧密代表性边界。为了展示我们的算法的效率和计算边界的精确度,我们分析了来自不同科学领域的三个数据集。在人类基因组中,我们在染色质环形成受损后发现了通过13号染色体和性染色体的回路产生意外影响。在宇宙中星系的分布中,我们发现了具有统计学意义的空隙。在具有明显不同拓扑的蛋白质同源物中,我们发现归因于配体相互作用,突变和物种之间差异的空隙。
Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research. It provides a rigorous method to compute robust topological features in discrete experimental observations that often contain various sources of uncertainties. Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets. Additionally, most analyses using PH are limited to computing the existence of nontrivial features. Precise localization of these features is not generally attempted because, by definition, localized representations are not unique and because of even higher computation cost. For scientific applications, such a precise location is a sine qua non for determining functional significance. Here, we provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets. To showcase the efficiency of our algorithms and the precision of computed boundaries, we analyze three data sets from different scientific fields. In the human genome, we found an unexpected effect on loops through chromosome 13 and the sex chromosomes, upon impairment of chromatin loop formation. In a distribution of galaxies in the universe, we found statistically significant voids. In protein homologs with significantly different topology, we found voids attributable to ligand-interaction, mutation, and differences between species.