在运行长度的HIFI读取中计算所有VS-ALL MEMS

论文标题

在运行长度的HIFI读取中计算所有VS-ALL MEMS

Computing all-vs-all MEMs in run-length encoded collections of HiFi reads

论文作者

Díaz-Domínguez, Diego, Puglisi, Simon J., Salmela, Leena

论文摘要

我们描述了一种算法，以在HIFI读取具有均聚物误差的HIFI读取中最大精确匹配（MEMS）。我们工作中的主要新颖性是我们求助于运行的压缩以帮助解决错误。我们的方法接收为输入，其中包含HIFI读取的运行长度编码的字符串集合及其反向补充。随后，它将编码分为两个数组，一个将相等符号运行的符号序列存储，另一个存储运行长度。拆分的目的是获取运行符号的BWT并相应地重新排序其长度。我们表明，这种特殊的BWT编码HIFI读取及其反向补充，它支持HIFI读取的双向查询。然后，我们提出了Belazzougui等人的MEM算法的变体。（2013年）利用BWT的运行长度编码和隐式双向属性来计算近似MEM。具体而言，如果该算法发现两个子字节，$ a_1 \ ldots a _p $和$ b_1 \ ldots b_p $，则有一个mem，则仅在其相应的长度序列，$ \ ell^{a} _1 _1 _1 \ ldots \ ell^ldots \ ell^el^a} a} a} a} a} a} a} a} a} _1 \ ell^{b} _p $，不要超出输入阈值。我们使用一个简单的度量来计算称为{\ em run-Length额外}的长度序列的相似性。我们的技术促进了使用均聚物误差的MEMS检测，因为它不需要动态编程即可找到近似匹配的匹配，而唯一的编辑是相等符号运行的长度。最后，我们提出了一种依赖几何数据结构来报告我们算法检测到的MEM的文本的方法。

We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing the sequence of symbols for equal-symbol runs and another storing the run lengths. The purpose of the split is to get the BWT of the run symbols and reorder their lengths accordingly. We show that this special BWT, as it encodes the HiFi reads and their reverse complements, supports bi-directional queries for the HiFi reads. Then, we propose a variation of the MEM algorithm of Belazzougui et al. (2013) that exploits the run-length encoding and the implicit bi-directional property of our BWT to compute approximate MEMs. Concretely, if the algorithm finds that two substrings, $a_1 \ldots a_p$ and $b_1 \ldots b_p$, have a MEM, then it reports the MEM only if their corresponding length sequences, $\ell^{a}_1 \ldots \ell^{a}_p$ and $\ell^{b}_1 \ldots \ell^{b}_p$, do not differ beyond an input threshold. We use a simple metric to calculate the similarity of the length sequences that we call the {\em run-length excess}. Our technique facilitates the detection of MEMs with homopolymer errors as it does not require dynamic programming to find approximate matches where the only edits are the lengths of the equal-symbol runs. Finally, we present a method that relies on a geometric data structure to report the text occurrences of the MEMs detected by our algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题