论文标题
近端政策梯度:具有策略梯度的PPO
Proximal Policy Gradient: PPO with Policy Gradient
论文作者
论文摘要
在本文中,我们提出了一种新的算法PPG(近端政策梯度),该算法均接近VPG(Vanilla Policy梯度)和PPO(近端策略优化)。 PPG目标是VPG目标的部分变化,PPG物镜的梯度与VPG物镜的梯度完全相同。为了增加策略更新的迭代次数,我们介绍了优势 - 政策平面并设计了新的剪裁策略。我们在OpenAI健身房和子弹机器人环境中进行十种随机种子的实验。 PPG的性能与PPO相当,并且熵的衰减比PPG慢。因此,我们证明可以通过使用原始策略梯度定理的梯度公式获得类似于PPO的性能。
In this paper, we propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization). The PPG objective is a partial variation of the VPG objective and the gradient of the PPG objective is exactly same as the gradient of the VPG objective. To increase the number of policy update iterations, we introduce the advantage-policy plane and design a new clipping strategy. We perform experiments in OpenAI Gym and Bullet robotics environments for ten random seeds. The performance of PPG is comparable to PPO, and the entropy decays slower than PPG. Thus we show that performance similar to PPO can be obtained by using the gradient formula from the original policy gradient theorem.