论文标题
从$ k $ -mer计数中的系统发育重建的不可能
Impossibility of phylogeny reconstruction from $k$-mer counts
论文作者
论文摘要
我们考虑在树上替代的两态序列演变模型下的系统发育估计。在序列长度往往无穷大的渐近方案中,我们表明,对于任何固定的$ k $,都不是$ k $ - 单独的$ k $ mer计数,而不是整个叶片序列,就可以进行统计上一致的系统发育估计。正式地,我们确定$ k $ -mer的联合分布在两棵不同树上的整个叶序列上的计数具有从$ 1 $界定的总变化距离,因为序列长度趋向于无穷大。我们的不可能结果意味着,统计一致性需要更复杂地使用$ k $ - 计数信息,例如以前的理论工作中开发的块技术。
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of $k$-mer count information, such as block techniques developed in previous theoretical work.