论文标题
贝叶斯的方法,无需唯一标识符即可链接数据
A Bayesian Approach to Linking Data Without Unique Identifiers
论文作者
论文摘要
现有的文件链接方法可能会产生亚最佳结果,因为它们既不考虑匹配的记录对之间的相互作用,也不考虑其中一个文件独有的变量之间的关系。此外,许多当前方法无法解决链接中的不确定性,这可能会导致对其中一个文件独有的变量之间关系的过度精确估计。贝叶斯的记录联系方法可以减少对科学关系的估计,并提供界定链接不确定性的间隔估计;但是,这些方法的实施通常可能是复杂的,并且在计算上很密集。本文为Python编程语言提供了GFS_SAMPLER软件包,该软件是利用贝叶斯方法进行文件链接的。从模型参数的关节后分布和链接排列中实现的链接过程。该算法将文件链接作为缺失的数据问题将其方法链接使用,并生成了多个链接的数据集。为了计算效率,仅存储链接排列,并分别使用每个排列进行了多次分析。该实现降低了链接过程的计算复杂性以及分析链接数据集的研究人员所需的专业知识。我们描述了GFS_SAMPLER软件包及其统计基础中实现的算法,并证明了其在示例数据集中的使用。
Existing file linkage methods may produce sub-optimal results because they consider neither the interactions between different pairs of matched records nor relationships between variables that are exclusive to one of the files. In addition, many of the current methods fail to address the uncertainty in the linkage, which may result in overly precise estimates of relationships between variables that are exclusive to one of the files. Bayesian methods for record linkage can reduce the bias in the estimation of scientific relationships of interest and provide interval estimates that account for the uncertainty in the linkage; however, implementation of these methods can often be complex and computationally intensive. This article presents the gfs_sampler package for the Python programming language that utilizes a Bayesian approach for file linkage. The linking procedure implemented in gfs_sampler samples from the joint posterior distribution of model parameters and the linking permutations. The algorithm approaches file linkage as a missing data problem and generates multiple linked data sets. For computational efficiency, only the linkage permutations are stored and multiple analyses are performed using each of the permutations separately. This implementation reduces the computational complexity of the linking process and the expertise required of researchers analyzing linked data sets. We describe the algorithm implemented in the gfs_sampler package and its statistical basis, and demonstrate its use on a sample data set.