镜子下降政策优化

论文标题

镜子下降政策优化

Mirror Descent Policy Optimization

论文作者

Tomar, Manan, Shani, Lior, Efroni, Yonathan, Ghavamzadeh, Mohammad

论文摘要

Mirror Descent（MD）是一种众所周知的凸出凸优化的一阶方法，最近已被证明是分析加固学习（RL）中信任区域算法的重要工具。但是，这种理论分析算法与实践中使用的算法之间存在很大的差距。受此启发，我们提出了一种有效的RL算法，称为{\ em镜像下降策略优化}（MDPO）。 MDPO迭代通过{\ em近似}解决信任区域问题的策略，其目标函数由两个术语组成：标准RL目标的线性化和一个限制两个连续策略以使其彼此接近的术语。每个更新都通过在此目标函数上采取多个梯度步骤来执行此近似。我们得出了{\ em on-policy}和{\ em off-policy} MDPO的变体，同时强调了由RL中现有的MD理论激发的重要设计选择。我们重点介绍了On-Policy MDPO与两种流行的Trust-Region RL算法之间的连接：TRPO和PPO，并表明明确执行信任区域约束实际上是{\ em not}在TRPO中获得高性能获得的必要性。然后，我们展示流行的软批评者（SAC）算法如何通过轻微的外部MDPO进行稍作修改而得出。总体而言，MDPO源自MD原理，提供了一种统一的方法来查看许多流行的RL算法，并且在许多连续的控制任务中，与TRPO，PPO和SAC相比或与PAR相比或在PAR上执行得更好。代码可从\ url {https://github.com/manantomar/mirror-descent-policy-optimization}获得。

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks. Code is available at \url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题