论文标题
使用不同的有条件gans生成不平衡数据时,改善相关捕获
Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs
论文作者
论文摘要
尽管生成的对抗网络(GAN)在文本,图像和视频上取得了显着的成功,但由于一些独特的挑战,例如在保留隐私的同时优化了合成患者数据的质量,因此仍在开发高质量的表格数据。在本文中,我们提出了DP-CGAN,这是一个私人条件的GAN框架,该框架由数据转换,采样,调理和网络培训组成,以生成现实且具有隐私性的表格数据。 DP-Cgans区分分类和连续变量,并将它们分别转化为潜在空间。然后,我们将条件矢量构建为附加输入,不仅在不平衡数据中介绍少数族裔类,还可以捕获变量之间的依赖性。我们将统计噪声注入DP-CGAN的网络训练过程中的梯度,以提供差异隐私保证。我们通过统计相似性,机器学习绩效和隐私测量值在三个公共数据集和两个现实世界中的个人健康数据集上使用最先进的生成模型广泛评估了我们的模型。我们证明,我们的模型优于其他可比模型,尤其是在捕获变量之间的依赖性时。最后,考虑到现实世界数据集的不同数据结构和特征,例如不平衡变量,异常分布和数据的稀疏性,我们介绍了数据实用性与隐私之间的平衡。
Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data.