论文标题
具有多种数据类型和约束的电子健康记录
Generating Electronic Health Records with Multiple Data Types and Constraints
论文作者
论文摘要
大规模共享电子健康记录(EHRS)可能会导致隐私入侵。最近的研究表明,通过通过生成对抗网络(GAN)框架模拟EHR可以减轻风险。然而,迄今为止开发的方法是有限的,因为它们1)专注于生成单一类型的数据(例如诊断代码),忽略其他数据类型(例如,人口统计,程序或生命体征)和2)并不表示特征之间的约束。在本文中,我们介绍了一种模拟由多种数据类型组成的EHR的方法1)提炼GAN模型,2)考虑特征限制,以及3)为该一代任务纳入关键实用性测量。我们对范德比尔特大学医学中心的770,000美元EHR的分析表明,新模型在保留基本统计数据,交叉功能相关性,潜在结构属性,特征限制和相关模式方面取得了更高的性能,而无需牺牲隐私。
Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs) and 2) do not represent constraints between features. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over $770,000$ EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms of retaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy.