论文标题
大规模系统发育分析的框架
A framework for large scale phylogenetic analysis
论文作者
论文摘要
随着国家之间的人民交流和商品的越来越多,流行病已成为越来越重要的问题,每天都在收集大量数据。因此,通常在个人计算机和桌面上进行的分析不再可行。现在,在高性能计算(HPC)环境和/或专用系统中运行此类任务是常见的。另一方面,我们经常通过图形和树进行这些分析,并运行算法以在此类结构中找到模式。因此,尽管面向图的数据库和处理系统在这种情况下可能有很大帮助,但据我们所知,没有解决方案依靠这些技术来应对大规模的系统发育分析挑战。该项目旨在开发一个模块化框架,以利用此类技术,即NEO4J。我们通过提出和开发一个框架来应对这一挑战,该框架允许代表大型系统发育网络和树木以及辅助数据,该框架支持此类数据的查询,并允许部署算法来推断/检测模式和预先计算的可视化,并作为Neo4J插件。该框架具有创新性,并为系统发育分析带来了一些优势,例如通过存储系统发育树将避免再次计算它们,并且通过使用多层网络将使它们之间的比较更有效,更可扩展。实验结果表明,它在大多数使用的操作中可能非常有效,并且受支持的算法符合其时间复杂性。
With growing exchanges of people and merchandise between countries, epidemics have become an issue of increasing importance and huge amounts of data are being collected every day. Hence, analyses that were usually run in personal computers and desktops are no longer feasible. It is now common to run such tasks in High-performance computing (HPC) environments and/or dedicated systems. On the other hand we are often dealing in these analyses with graphs and trees, and running algorithms to find patterns in such structures. Hence, although graph oriented databases and processing systems can be of much help in this setting, as far as we know there is no solution relying on these technologies to address large scale phylogenetic analysis challenges. This project aims to develop a modular framework for large scale phylogenetic analysis that exploits such technologies, namely Neo4j. We address this challenge by proposing and developing a framework which allows representing large phylogenetic networks and trees, as well as ancillary data, that supports queries on such data, and allows the deployment of algorithms for inferring/detecting patterns and pre-computing visualizations, as a Neo4j plugin. This framework is innovative and brings several advantages to the phylogenetic analysis, such as by storing the phylogenetic trees will avoid having to compute them again, and by using multilayer networks will make the comparison between them more efficient and scalable. Experimental results showcase that it can be very efficient in the mostly used operations and that the supported algorithms comply with their time complexity.