可证明对强化学习中的后门政策的辩护

论文标题

可证明对强化学习中的后门政策的辩护

Provable Defense against Backdoor Policies in Reinforcement Learning

论文作者

Bharti, Shubham Kumar, Zhang, Xuezhou, Singla, Adish, Zhu, Xiaojin

论文摘要

我们提出了一种可证明的防御机制，以针对子空间触发假设下的强化学习中的后门政策。后门政策是一种安全威胁，在这种威胁中，对手发布了看似良好的行为策略，实际上允许隐藏的触发器。在部署期间，对手可以以特定的方式修改观察到的状态，以触发意外动作并损害代理。我们假设代理人没有资源来重新培训一个好的政策。取而代之的是，我们的国防机制通过将观察到的状态投射到“安全子空间”中，从而消毒了后门政策，这是根据与干净（非触发）环境的少量互动所估计的。如果触发器的存在，我们消毒的政策在存在触发器的情况下实现了$ε$近似最佳性，前提是清洁互动的数量为$ o \ left（\ frac {d} {（1-γ）^4ε^2} \ right）$ $γ$是折现因子，而$ d $ $ d $是状态空间的尺寸。从经验上讲，我们证明我们的消毒防御在两个Atari游戏环境中表现良好。

We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption. A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. During deployment, the adversary can modify observed states in a particular way to trigger unexpected actions and harm the agent. We assume the agent does not have the resources to re-train a good policy. Instead, our defense mechanism sanitizes the backdoor policy by projecting observed states to a 'safe subspace', estimated from a small number of interactions with a clean (non-triggered) environment. Our sanitized policy achieves $ε$ approximate optimality in the presence of triggers, provided the number of clean interactions is $O\left(\frac{D}{(1-γ)^4 ε^2}\right)$ where $γ$ is the discounting factor and $D$ is the dimension of state space. Empirically, we show that our sanitization defense performs well on two Atari game environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题