论文标题
依次学习具有可能性比分数的因果定向无环图的拓扑排序
Sequentially learning the topological ordering of causal directed acyclic graphs with likelihood ratio scores
论文作者
论文摘要
因果发现是数据挖掘场景中因果关系的学习,它具有强烈的科学和理论兴趣,作为确定“导致什么原因?”的起点。有时,有时可以识别并准确估计并准确估计有因果的无环图(DAG),而不是马尔可夫等效类别的图表,这给出了因果方向的歧义。本文的重点是通过一般的顺序排序过程突出显示具有一般错误分布的DAG的可识别性和估计,该过程一次订购一个变量,从一个根节点开始,然后是根节点的子女,依此类推,依此类推。我们证明了这种一般方法的新应用来估计DAG的拓扑排序。在过程的每个步骤中,仅在回归残差上计算出简单的似然比分数,以决定下一个节点以附加到当前的部分排序。我们算法在P节问题上的计算复杂性是O(PD),其中D是最大邻域大小。在温和的假设下,我们程序的人口版本证明了基础DAG的真实顺序。我们提供了广泛的数值证据,以证明此顺序程序尺度可能到可能数千个节点,并且可以很好地适用于高维数据。我们伴随着这些数值实验,并应用于单细胞基因表达数据集。
Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset.