论文标题
通过耗散马鞍流动动力学限制了强化学习
Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics
论文作者
论文摘要
在受限的强化学习(C-RL)中,代理商试图从环境中学习一种政策,该政策最大程度地提高了预期的累积奖励,同时满足了次要累积奖励约束的最低要求。最近,已经提出了几种植根于采样的原始偶对偶方法中的算法,以在策略空间中解决此问题。但是,此类方法基于随机梯度下降算法,其轨迹仅在依赖算法历史的混合输出阶段之后才能连接到最佳策略。结果,行为政策与最佳策略之间存在不匹配。在这项工作中,我们提出了一种新的算法,用于受约束的RL,该算法不会受到这些局限性的困扰。利用正规化的鞍流动力学的最新结果,我们开发了一种新型的随机梯度下降算法,其轨迹几乎可以肯定地融合到最佳策略。
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward while satisfying minimum requirements in secondary cumulative reward constraints. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. However, such methods are based on stochastic gradient descent ascent algorithms whose trajectories are connected to the optimal policy only after a mixing output stage that depends on the algorithm's history. As a result, there is a mismatch between the behavioral policy and the optimal one. In this work, we propose a novel algorithm for constrained RL that does not suffer from these limitations. Leveraging recent results on regularized saddle-flow dynamics, we develop a novel stochastic gradient descent-ascent algorithm whose trajectories converge to the optimal policy almost surely.