论文标题
共发生问题的复杂性
The Complexity of the Co-Occurrence Problem
论文作者
论文摘要
让$ s $是字母$σ$上的长度$ n $的字符串,让$ q $是$ q $ q \ geq 2 $的$σ$的子集。 “共发生问题”是构建一个支持以下查询的紧凑数据结构:给定整数$ w $返回$ w $ s $的长度 - $ w $ s $的$ s $,其中包含$ q $的每个字符至少一次。这是一个自然的弦问题,应用程序,例如数据挖掘,自然语言处理和DNA分析。艺术的状态是$ o(\ sqrt {nq})$空间数据结构,该结构(随着一些次要添加)支持$ o(\ log \ log \ log n)$ time中的查询[CPM 2021]。 我们的贡献如下。首先,我们根据新的自然参数$ d $分析了问题,提供了一个简单的数据结构,该数据结构使用$ o(d)$ space并支持$ o(\ log \ log \ log n)$ time中的查询。预处理算法在$ s $上进行一次通行证,以预期的$ O(n)$时间运行,并在输入中使用$ o(d)$ space。此外,我们表明$ o(d)$ space是最佳的,并且$ o(\ log \ log n)$ - 时间查询是最佳给定最佳空间的最佳时间。其次,我们绑定了$ d = o(\ sqrt {nq})$,根据$ n $和$ n $和$ q $的干净界限,与最新状态相匹配。此外,我们证明$ω(\ sqrt {nq})$位是最坏的空间,这意味着$ o(\ sqrt {nq})$上限紧密到polygarithminmic因子内。我们所有的结果均基于简单和直观的组合思想,这些想法简化了技术的状态。
Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O(\sqrt{nq})$ space data structure that -- with some minor additions -- supports queries in $O(\log\log n)$ time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter $d$, giving a simple data structure that uses $O(d)$ space and supports queries in $O(\log\log n)$ time. The preprocessing algorithm does a single pass over $S$, runs in expected $O(n)$ time, and uses $O(d)$ space in addition to the input. Furthermore, we show that $O(d)$ space is optimal and that $O(\log\log n)$-time queries are optimal given optimal space. Secondly, we bound $d = O(\sqrt{nq})$, giving clean bounds in terms of $n$ and $q$ that match the state of the art. Furthermore, we prove that $Ω(\sqrt{nq})$ bits of space is necessary in the worst case, meaning that the $O(\sqrt{nq})$ upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.