论文标题
编辑距离下的字符串消毒:改进和广泛
String Sanitization Under Edit Distance: Improved and Generalized
论文作者
论文摘要
让$ w $是字母$σ$,$ k $的长度$ n $的字符串,为正整数,而$ \ mathcal {s} $是$ w $的一组长度 - $ k $ substrings。 ETFS问题要求我们构造一个字符串$ x _ {\ mathrm {ed}} $,这样:(i)$ \ mathcal {s} $ no string of $ x _ {\ mathrm {ed}} $; (ii)$ w $和$ x _ {\ mathrm {ed}} $在$σ$上的所有其他长度-K $ substring的顺序 - 频率)相同; (iii)$ x _ {\ mathrm {ed}} $具有最小的编辑距离至$ w $。当$ w $代表个人的数据,而$ \ Mathcal {s} $代表一组机密模式时,ETFS问题要求转换$ W $以保护其隐私和效用[Bernardini等,ECML PKDD 2019]。 可以用$ \ Mathcal {o}(n^2k)$时间来解决ETF [Bernardini等,CPM 2020]。同一篇论文表明,对于任何$δ> 0 $,除非强的指数时间假设(SETH)是错误的,否则无法用$ \ Mathcal {o}(n^{2-δ})$时间求解ETF。我们的主要结果可以总结如下:(i)$ \ MATHCAL {o}(n^2 \ log^2k)$ time算法以求解ETFS; (ii)$ \ MATHCAL {o}(n^2 \ log^2n)$ - 时间算法求解AETFS,这是ETF的概括,其中$ \ Mathcal {S} $的元素可以具有任意长度。因此,除非塞思失败,否则我们的算法是多毛体因子的最佳选择。除了弦式消毒之外,我们的技术还可以激发解决与正则表达式或无上下文语法有关的其他问题的解决方案。
Let $W$ be a string of length $n$ over an alphabet $Σ$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Σ$ (and thus the frequency) is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\mathcal{O}(n^{2-δ})$ time, for any $δ>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\mathcal{O}(n^2\log^2k)$-time algorithm to solve ETFS; and (ii) an $\mathcal{O}(n^2\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.