论文标题
综合电子健康记录:囊性纤维化患者组
Synthesising Electronic Health Records: Cystic Fibrosis Patient Group
论文作者
论文摘要
班级失衡通常会降低监督学习算法的预测性能。可以通过对最近的邻居之间的噪声或插值(如传统的Smote方法)之间的噪声或插值来获得平衡的类别。使用增强性的超采样表格数据,就像计算机视觉任务中的典型情况一样,可以通过深层生成模型来实现。深层生成模型是有效的数据合成器,因为它们能够捕获复杂的基础分布。医疗保健中的合成数据可以通过确保患者隐私来增强医疗保健提供者之间的互操作性。配备了大型合成数据集,可以很好地代表小型患者群体,医疗保健中的机器学习可以解决当前偏见和普遍性的挑战。本文评估合成数据生成器能够综合患者电子健康记录的能力。我们测试了合成数据的效用,以进行患者结果分类,并在通过合成数据增加不平衡数据集时观察到预测性能的提高。
Class imbalance can often degrade predictive performance of supervised learning algorithms. Balanced classes can be obtained by oversampling exact copies, with noise, or interpolation between nearest neighbours (as in traditional SMOTE methods). Oversampling tabular data using augmentation, as is typical in computer vision tasks, can be achieved with deep generative models. Deep generative models are effective data synthesisers due to their ability to capture complex underlying distributions. Synthetic data in healthcare can enhance interoperability between healthcare providers by ensuring patient privacy. Equipped with large synthetic datasets which do well to represent small patient groups, machine learning in healthcare can address the current challenges of bias and generalisability. This paper evaluates synthetic data generators ability to synthesise patient electronic health records. We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.