论文标题
与注入的吸引子混淆和检测ML对抗攻击
Confusing and Detecting ML Adversarial Attacks with Injected Attractors
论文作者
论文摘要
许多机器学习对抗性攻击都可以通过遵循某些攻击目标函数的梯度(明确或隐式)的梯度来找到受害者模型$ {\ MATHCAL M} $的对抗样本。为了混淆和检测此类攻击,我们采用了修改这些功能的积极主动方法,目的是误导了对某些本地最小值的攻击,或者是通过分析仪可以轻松拾取的一些指定区域。为了实现这一目标,我们建议在原本平滑的功能中添加大量的文物(我们称之为$吸引者$)。吸引子是输入空间中的一个点,在该空间中,其附近的样品指向它。我们观察到,水印方案的解码器具有吸引子的特性,并提供了一种通用方法,将吸引子从水印解码器注入受害者模型$ {\ Mathcal M} $。这种原则性的方法使我们能够利用已知水印方案的可伸缩性和鲁棒性,并提供结果的解释性。实验研究表明,我们的方法具有竞争性能。例如,对于对CIFAR-10数据集的非目标攻击,我们可以将DeepFool的总体攻击成功率降低到1.9%,而已知的防御盖,FS和磁铁可以将速度分别降低到90.8%,98.5%和78.5%。
Many machine learning adversarial attacks find adversarial samples of a victim model ${\mathcal M}$ by following the gradient of some attack objective functions, either explicitly or implicitly. To confuse and detect such attacks, we take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals, or to some designated regions that can be easily picked up by an analyzer. To achieve this goal, we propose adding a large number of artifacts, which we called $attractors$, onto the otherwise smooth function. An attractor is a point in the input space, where samples in its neighborhood have gradient pointing toward it. We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors from a watermark decoder into the victim model ${\mathcal M}$. This principled approach allows us to leverage on known watermarking schemes for scalability and robustness and provides explainability of the outcomes. Experimental studies show that our method has competitive performance. For instance, for un-targeted attacks on CIFAR-10 dataset, we can reduce the overall attack success rate of DeepFool to 1.9%, whereas known defense LID, FS and MagNet can reduce the rate to 90.8%, 98.5% and 78.5% respectively.