近端策略优化平滑算法

论文标题

近端策略优化平滑算法

Proximal Policy Optimization Smoothed Algorithm

论文作者

Zhu, Wangshu, Rosendo, Andre

论文摘要

近端策略优化（PPO）在策略搜索中产生了最新的结果，这是一个强化学习的子场，其要点之一是使用替代目标函数来限制每个策略更新中的步骤大小。尽管这种限制是有用的，但该算法仍然遭受性能不稳定性和曲线突然变平的优化效率低下。为了解决此问题，我们提出了一个PPO变体，称为近端策略优化平滑算法（PPO），其关键改进是使用功能剪切方法而不是平坦的剪切方法。我们将我们的方法与PPO和PPORB进行了比较，该方法采用了回滚方法，并证明我们的方法可以在每个时间步骤进行更准确的更新，而不是其他PPO方法。此外，我们表明它在挑战连续控制任务中的性能和稳定性都优于最新的PPO变体。

Proximal policy optimization (PPO) has yielded state-of-the-art results in policy search, a subfield of reinforcement learning, with one of its key points being the use of a surrogate objective function to restrict the step size at each policy update. Although such restriction is helpful, the algorithm still suffers from performance instability and optimization inefficiency from the sudden flattening of the curve. To address this issue we present a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS), and its critical improvement is the use of a functional clipping method instead of a flat clipping method. We compare our method with PPO and PPORB, which adopts a rollback clipping method, and prove that our method can conduct more accurate updates at each time step than other PPO methods. Moreover, we show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题