论文标题
PERD:基于扰动敏感性的NLP应用上的神经特洛伊木马检测框架
PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications
论文作者
论文摘要
深度神经网络(DNN)已被证明容易受到特洛伊木马攻击的影响。神经特洛伊木马是一种有针对性的中毒攻击,将后门嵌入受害者中,并被输入空间中的扳机激活。 DNN在关键系统中的部署不断增加,而外包DNN训练的激增(使特洛伊木马的攻击更加容易)使得需要发现特洛伊木马攻击。尽管在图像结构域中研究了神经特洛伊木马的检测,但NLP域中缺乏解决方案。在本文中,我们通过分析模型输出的偏差来提出一个模型级特洛伊木马检测框架,当我们对输入引入专门制作的扰动时。特别是,我们将模型对扰动输入的响应提取为模型的“签名”,并训练元分类器,以确定模型是否基于其签名来开发模型。我们在我们创建的NLP模型数据集和Trojai的Trojaned NLP模型的公共数据集上都证明了我们提出的方法的有效性。此外,我们提出了检测方法的轻量级变体,该变体在保留检测率的同时减少了检测时间。
Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan attacks. Neural Trojan is a type of targeted poisoning attack that embeds the backdoor into the victim and is activated by the trigger in the input space. The increasing deployment of DNNs in critical systems and the surge of outsourcing DNN training (which makes Trojan attack easier) makes the detection of Trojan attacks necessary. While Neural Trojan detection has been studied in the image domain, there is a lack of solutions in the NLP domain. In this paper, we propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input. Particularly, we extract the model's responses to perturbed inputs as the `signature' of the model and train a meta-classifier to determine if a model is Trojaned based on its signature. We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI. Furthermore, we propose a lightweight variant of our detection method that reduces the detection time while preserving the detection rates.