共发生问题的复杂性

论文标题

共发生问题的复杂性

The Complexity of the Co-Occurrence Problem

论文作者

Bille, Philip, Gørtz, Inge Li, Stordalen, Tord

论文摘要

让$ s $是字母$σ$上的长度$ n $的字符串，让$ q $是$ q $ q \ geq 2 $的$σ$的子集。 “共发生问题”是构建一个支持以下查询的紧凑数据结构：给定整数$ w $返回$ w $ s $的长度 - $ w $ s $的$ s $，其中包含$ q $的每个字符至少一次。这是一个自然的弦问题，应用程序，例如数据挖掘，自然语言处理和DNA分析。艺术的状态是$ o（\ sqrt {nq}）$空间数据结构，该结构（随着一些次要添加）支持$ o（\ log \ log \ log n）$ time中的查询[CPM 2021]。我们的贡献如下。首先，我们根据新的自然参数$ d $分析了问题，提供了一个简单的数据结构，该数据结构使用$ o（d）$ space并支持$ o（\ log \ log \ log n）$ time中的查询。预处理算法在$ s $上进行一次通行证，以预期的$ O（n）$时间运行，并在输入中使用$ o（d）$ space。此外，我们表明$ o（d）$ space是最佳的，并且$ o（\ log \ log n）$ - 时间查询是最佳给定最佳空间的最佳时间。其次，我们绑定了$ d = o（\ sqrt {nq}）$，根据$ n $和$ n $和$ q $的干净界限，与最新状态相匹配。此外，我们证明$ω（\ sqrt {nq}）$位是最坏的空间，这意味着$ o（\ sqrt {nq}）$上限紧密到polygarithminmic因子内。我们所有的结果均基于简单和直观的组合思想，这些想法简化了技术的状态。

Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O(\sqrt{nq})$ space data structure that -- with some minor additions -- supports queries in $O(\log\log n)$ time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter $d$, giving a simple data structure that uses $O(d)$ space and supports queries in $O(\log\log n)$ time. The preprocessing algorithm does a single pass over $S$, runs in expected $O(n)$ time, and uses $O(d)$ space in addition to the input. Furthermore, we show that $O(d)$ space is optimal and that $O(\log\log n)$-time queries are optimal given optimal space. Secondly, we bound $d = O(\sqrt{nq})$, giving clean bounds in terms of $n$ and $q$ that match the state of the art. Furthermore, we prove that $Ω(\sqrt{nq})$ bits of space is necessary in the worst case, meaning that the $O(\sqrt{nq})$ upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题