论文标题

共发生问题的复杂性

The Complexity of the Co-Occurrence Problem

论文作者

Bille, Philip, Gørtz, Inge Li, Stordalen, Tord

论文摘要

让$ s $是字母$σ$上的长度$ n $的字符串,让$ q $是$ q $ q \ geq 2 $的$σ$的子集。 “共发生问题”是构建一个支持以下查询的紧凑数据结构:给定整数$ w $返回$ w $ s $的长度 - $ w $ s $的$ s $,其中包含$ q $的每个字符至少一次。这是一个自然的弦问题,应用程序,例如数据挖掘,自然语言处理和DNA分析。艺术的状态是$ o(\ sqrt {nq})$空间数据结构,该结构(随着一些次要添加)支持$ o(\ log \ log \ log n)$ time中的查询[CPM 2021]。 我们的贡献如下。首先,我们根据新的自然参数$ d $分析了问题,提供了一个简单的数据结构,该数据结构使用$ o(d)$ space并支持$ o(\ log \ log \ log n)$ time中的查询。预处理算法在$ s $上进行一次通行证,以预期的$ O(n)$时间运行,并在输入中使用$ o(d)$ space。此外,我们表明$ o(d)$ space是最佳的,并且$ o(\ log \ log n)$ - 时间查询是最佳给定最佳空间的最佳时间。其次,我们绑定了$ d = o(\ sqrt {nq})$,根据$ n $和$ n $和$ q $的干净界限,与最新状态相匹配。此外,我们证明$ω(\ sqrt {nq})$位是最坏的空间,这意味着$ o(\ sqrt {nq})$上限紧密到polygarithminmic因子内。我们所有的结果均基于简单和直观的组合思想,这些想法简化了技术的状态。

Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O(\sqrt{nq})$ space data structure that -- with some minor additions -- supports queries in $O(\log\log n)$ time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter $d$, giving a simple data structure that uses $O(d)$ space and supports queries in $O(\log\log n)$ time. The preprocessing algorithm does a single pass over $S$, runs in expected $O(n)$ time, and uses $O(d)$ space in addition to the input. Furthermore, we show that $O(d)$ space is optimal and that $O(\log\log n)$-time queries are optimal given optimal space. Secondly, we bound $d = O(\sqrt{nq})$, giving clean bounds in terms of $n$ and $q$ that match the state of the art. Furthermore, we prove that $Ω(\sqrt{nq})$ bits of space is necessary in the worst case, meaning that the $O(\sqrt{nq})$ upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源