论文标题
马尔可夫决策过程的安全政策改进
Safe Policy Improvement in Constrained Markov Decision Processes
论文作者
论文摘要
通过加强学习(RL)从给定的一组正式要求中自动综合政策取决于构建奖励信号,并由许多政策改善步骤的迭代应用组成。综合算法必须在一个目标中平衡目标,安全性和舒适性要求,并确保政策改进不会增加违反安全要求的次数,尤其是对于关键安全应用程序。在这项工作中,我们通过解决其两个主要挑战来提出解决方案:从一系列正式要求和安全的政策更新中进行奖励成型。对于前者,我们提出了一个自动奖励整形程序,定义了符合任务规范的标量奖励信号。对于后者,我们引入了一种算法,以确保以安全的保证以安全的方式改进该政策。我们还讨论了采用基于模型的RL算法来有效地使用收集的数据并在预测的轨迹上训练无模型的代理,在这种轨迹中,安全违规行为的影响与现实世界没有相同的影响。最后,我们在标准对照基准中证明,即使在高参数的严重扰动下,由此产生的学习过程也是有效且健壮的。
The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.