论文标题
仅标签的会员推理攻击
Label-Only Membership Inference Attacks
论文作者
论文摘要
会员推理攻击是机器学习模型的最简单的隐私泄漏形式之一:给定数据点和模型,确定是否使用该点来训练模型。当询问其培训数据时,现有的会员推理攻击利用模型的异常信心。如果对手只能访问模型的预测标签,而没有置信度衡量标签,则这些攻击将不适用。在本文中,我们介绍了仅标签的会员推理攻击。我们的攻击不是依靠置信度得分,而是评估模型在扰动下预测标签的鲁棒性,以获得细粒度的成员信号。这些扰动包括常见的数据增强或对抗性示例。我们从经验上表明,我们的仅标签会员资格推理攻击在需要访问模型知识的先前攻击方面执行。我们进一步证明,仅标签的攻击打破了对成员推理攻击的多次防御攻击,这些攻击(隐式或明确)依赖于我们称为信心掩盖的现象。这些防御能力改变了模型的置信得分,以阻止攻击,但使模型的预测标签保持不变。我们仅使用标签的攻击表明,信心掩盖并不是针对会员推论的可行防御策略。最后,我们研究了最差的仅标签攻击,这些攻击推断出少数离群数据点的成员身份。我们表明,仅标签的攻击也与这种情况下的基于置信度的攻击相匹配。我们发现,具有不同隐私和(强)L2正则化的培训模型是成功阻止所有攻击的唯一已知防御策略。即使差异隐私预算太高,无法提供有意义的可证明的保证,这仍然是正确的。
Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.