论文标题

可证明对强化学习中的后门政策的辩护

Provable Defense against Backdoor Policies in Reinforcement Learning

论文作者

Bharti, Shubham Kumar, Zhang, Xuezhou, Singla, Adish, Zhu, Xiaojin

论文摘要

我们提出了一种可证明的防御机制,以针对子空间触发假设下的强化学习中的后门政策。后门政策是一种安全威胁,在这种威胁中,对手发布了看似良好的行为策略,实际上允许隐藏的触发器。在部署期间,对手可以以特定的方式修改观察到的状态,以触发意外动作并损害代理。我们假设代理人没有资源来重新培训一个好的政策。取而代之的是,我们的国防机制通过将观察到的状态投射到“安全子空间”中,从而消毒了后门政策,这是根据与干净(非触发)环境的少量互动所估计的。如果触发器的存在,我们消毒的政策在存在触发器的情况下实现了$ε$近似最佳性,前提是清洁互动的数量为$ o \ left(\ frac {d} {(1-γ)^4ε^2} \ right)$ $γ$是折现因子,而$ d $ $ d $是状态空间的尺寸。从经验上讲,我们证明我们的消毒防御在两个Atari游戏环境中表现良好。

We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption. A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. During deployment, the adversary can modify observed states in a particular way to trigger unexpected actions and harm the agent. We assume the agent does not have the resources to re-train a good policy. Instead, our defense mechanism sanitizes the backdoor policy by projecting observed states to a 'safe subspace', estimated from a small number of interactions with a clean (non-triggered) environment. Our sanitized policy achieves $ε$ approximate optimality in the presence of triggers, provided the number of clean interactions is $O\left(\frac{D}{(1-γ)^4 ε^2}\right)$ where $γ$ is the discounting factor and $D$ is the dimension of state space. Empirically, we show that our sanitization defense performs well on two Atari game environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源