论文标题
学习网络中的马尔可夫路径顺序
Learning the Markov order of paths in a network
论文作者
论文摘要
我们研究了以分类序列学习马尔可夫顺序的问题,该序列代表网络中的路径,即可变长度的序列,其中状态之间的过渡被限制为已知图。这些数据对标准马尔可夫订单检测方法和需求建模技术提出了挑战,这些技术明确说明了图形约束。我们开发了一种贝叶斯学习技术,采用多阶建模框架,与基于可能性比率测试的竞争方法相比,(i)更可靠地检测到正确的马尔可夫顺序,(ii)与使用AIC或BIC和(III)的方法相比,需要较少的数据,并且(III)是强大的,而(iii)是强有力的违反对层面约束的知识的强大知识。我们进一步表明,最近发表的使用可能性比测试的方法具有过度拟合真正的马尔可夫路径秩序的趋势,而对于我们的贝叶斯技术来说,这并非如此。我们的方法对于数据科学家对分类序列数据分析模式的数据很重要,这些模式受(部分)已知的约束,例如使用禁止单词,移动性轨迹和单击流数据或生物信息学中的序列数据的序列。在应对模型选择的主要挑战时,我们的工作与不断增长的研究体系进一步相关,该研究强调网络分析中对高阶模型的需求。
We study the problem of learning the Markov order in categorical sequences that represent paths in a network, i.e. sequences of variable lengths where transitions between states are constrained to a known graph. Such data pose challenges for standard Markov order detection methods and demand modelling techniques that explicitly account for the graph constraint. Adopting a multi-order modelling framework for paths, we develop a Bayesian learning technique that (i) more reliably detects the correct Markov order compared to a competing method based on the likelihood ratio test, (ii) requires considerably less data compared to methods using AIC or BIC, and (iii) is robust against partial knowledge of the underlying constraints. We further show that a recently published method that uses a likelihood ratio test has a tendency to overfit the true Markov order of paths, which is not the case for our Bayesian technique. Our method is important for data scientists analyzing patterns in categorical sequence data that are subject to (partially) known constraints, e.g. sequences with forbidden words, mobility trajectories and click stream data, or sequence data in bioinformatics. Addressing the key challenge of model selection, our work is further relevant for the growing body of research that emphasizes the need for higher-order models in network analysis.