在线屏蔽增强学习

论文标题

在线屏蔽增强学习

Online Shielding for Reinforcement Learning

论文作者

Könighofer, Bettina, Rudolf, Julian, Palmisano, Alexander, Tappler, Martin, Bloem, Roderick

论文摘要

除了最近对加强学习的令人印象深刻的结果（RL）外，安全仍然是RL的主要研究挑战之一。 RL是一种机器学习方法，用于确定马尔可夫决策过程（MDPS）中近距离政策的方法。在本文中，我们考虑了与MDP的安全相关片段以及时间逻辑安全规范的设置，并且可以通过在短时间内计划在未来的短时间内避免进行许多安全违规。我们提出了一种在线安全屏蔽RL代理的方法。在运行时，盾牌分析了每个可用动作的安全性。对于任何操作，盾牌都会计算执行此操作时下一个$ k $步骤中不违反安全规范的最大概率。基于此概率和给定的阈值，盾牌决定是否阻止代理的动作。现有的离线屏蔽方法会提前详尽地计算所有州行动组合的安全性，从而产生巨大的计算时间和大量的记忆消耗。在线屏蔽背后的直觉是在运行时计算在不久的将来可以达到的所有州的集合。对于这些状态中的每一个，一旦达到考虑的状态之一，就可以分析所有可用动作的安全性。我们的方法非常适合高级计划问题，在这些问题之间可以将决策之间的时间用于安全计算，并且代理商等到完成这些计算才能可持续。为了进行评估，我们选择了经典计算机游戏蛇的2播放器版本。该游戏代表了一个高级计划问题，需要快速决策，而多人游戏设置会引起较大的状态空间，这在计算上详尽地分析的状态昂贵。

Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题