论文标题
学习通过虚拟信任区域来限制策略优化
Learning to Constrain Policy Optimization with Virtual Trust Region
论文作者
论文摘要
我们为策略梯度增强学习引入了一种约束的优化方法,该方法使用虚拟信任区域来调节每个策略更新。除了将一个单一旧政策作为普通信任区域的距离之外,我们还建议通过另一个虚拟策略形成第二个信任区域,代表了过去的各种过去的政策。然后,我们执行新政策,以保持更靠近虚拟政策,如果旧政策的运作较差,这将是有益的。更重要的是,我们提出了一种机制,可以自动从过去政策的记忆中自动构建虚拟策略,从而在优化过程中动态学习适当的虚拟信任区域提供了新的能力。在不同的环境中,我们提出的称为记忆限制的政策优化(MCPO)的方法在各种环境中进行了检查,包括机器人的运动控制,带有稀疏奖励和Atari游戏的导航,始终如一地证明了针对最近的policy式限制性约束策略梯度方法的竞争性能。
We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.