论文标题
使用字符串压缩的重复文本的BWT有效构造
Efficient Construction of the BWT for Repetitive Text Using String Compression
论文作者
论文摘要
我们提出了一种新的半外部算法,该算法构建了洞穴 - Bauer等人的旋转器变换变体。 (又称BCR BWT)在线性预期时间。当输入大规模和重复性时,我们的方法使用压缩技术来降低计算成本。具体而言,我们以诱导的后缀排序(ISS)为基础,并诉诸于运行长度和语法压缩,以保持紧凑的形式的中间结果。我们的压缩格式不仅可以节省空间,还可以加快所需的计算。当文本重复时,我们的实验显示了重要的空间和计算时间节省。在中等大小的真实人基因组组件(14.2 GB-75.05 GB)中,我们的内存峰平均比最先进的BCR BWT BWT构造算法(\ texttt {ropebwt2})的峰值小1.7倍,同时运行5倍。我们当前的实现还能够使用118.83 GB的工作记忆(约为输入大小的10 \%)计算41.21小时的400个真实人类基因组组件(1.2 tb)的BCR BWT。有趣的是,我们在1.2 TB文件中报告的结果主要由在内存约束(特别是I/O操作)下扫描巨大文件的困难。这个事实表明,通过更仔细的方法实现我们可以表现得更好,从而将其扩展到更大的尺寸。
We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (\texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10\% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.