论文标题
基于GAN的增强方法对皮肤病变图像的(DE)偏置效应
The (de)biasing effect of GAN-based augmentation methods on skin lesion images
论文作者
论文摘要
现在,新的医疗数据集对公众开放,可以进行更好,更广泛的研究。尽管新的数据集准备好了,但新数据集仍可能是影响学习过程的虚假相关性的来源。此外,数据收集通常不够大,而且通常是不平衡的。减轻数据不平衡的一种方法是使用生成对抗网络(GAN)使用数据增强来扩展具有高质量图像的数据集。 GAN通常在与目标数据相同的偏置数据集上进行训练,从而导致更多偏差实例。这项工作探讨了无条件和有条件的gan,以比较其偏差遗传以及合成数据如何影响模型。我们提供了大量的手动数据注释,可能在著名的ISIC数据集上具有皮肤病变的偏见。此外,我们研究了对实际和合成数据训练的分类模型,并具有反事实偏见的解释。我们的实验表明,gan遗传了偏见,有时甚至会放大它们,从而导致更强的虚假相关性。手动数据注释和合成图像可公开可用于可重复的科学研究。
New medical datasets are now more open to the public, allowing for better and more extensive research. Although prepared with the utmost care, new datasets might still be a source of spurious correlations that affect the learning process. Moreover, data collections are usually not large enough and are often unbalanced. One approach to alleviate the data imbalance is using data augmentation with Generative Adversarial Networks (GANs) to extend the dataset with high-quality images. GANs are usually trained on the same biased datasets as the target data, resulting in more biased instances. This work explored unconditional and conditional GANs to compare their bias inheritance and how the synthetic data influenced the models. We provided extensive manual data annotation of possibly biasing artifacts on the well-known ISIC dataset with skin lesions. In addition, we examined classification models trained on both real and synthetic data with counterfactual bias explanations. Our experiments showed that GANs inherited biases and sometimes even amplified them, leading to even stronger spurious correlations. Manual data annotation and synthetic images are publicly available for reproducible scientific research.