通过预测在安全集合上进行的安全加固学习：如何实现最佳性？

论文标题

通过预测在安全集合上进行的安全加固学习：如何实现最佳性？

Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?

论文作者

Gros, Sebastien, Zanon, Mario, Bemporad, Alberto

论文摘要

尽管取得了所有成功，但增强学习（RL）仍然努力为学习政策的闭环行为提供正式的保证。除其他事项外，确保RL在安全至关重要系统方面的安全是一个非常活跃的研究主题。最近的一些贡献建议将学习政策提供的投入的预测依靠，以确保系统安全永远不会受到危害。不幸的是，目前尚不清楚是否可以在不破坏学习过程的情况下执行此操作。本文解决了这个问题。在$ q $ - 学习和政策梯度技术的背景下，对问题进行了分析。我们表明，在$ q $的上下文中，尽管一个简单的替代方案可以解决该问题，但在策略梯度方法的上下文中可以使用简单的更正，以确保策略梯度是公正的，但可以使用简单的校正来解决问题。提出的结果扩展到基于强大的MPC技术的安全预测。

For all its successes, Reinforcement Learning (RL) still struggles to deliver formal guarantees on the closed-loop behavior of the learned policy. Among other things, guaranteeing the safety of RL with respect to safety-critical systems is a very active research topic. Some recent contributions propose to rely on projections of the inputs delivered by the learned policy into a safe set, ensuring that the system safety is never jeopardized. Unfortunately, it is unclear whether this operation can be performed without disrupting the learning process. This paper addresses this issue. The problem is analysed in the context of $Q$-learning and policy gradient techniques. We show that the projection approach is generally disruptive in the context of $Q$-learning though a simple alternative solves the issue, while simple corrections can be used in the context of policy gradient methods in order to ensure that the policy gradients are unbiased. The proposed results extend to safe projections based on robust MPC techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题