高斯流程政策优化

论文标题

高斯流程政策优化

Gaussian Process Policy Optimization

论文作者

Rao, Ashish, Sarkar, Bidipta, Narayanan, Tejas

论文摘要

我们提出了一种新型的参与者批评，无模型的增强学习算法，该学习算法采用贝叶斯的参数空间探索方法来解决环境。鉴于策略的参数，使用高斯流程来学习策略的预期回报。该系统是通过使用梯度下降来更新参数的新替代损失函数来培训的，该功能由近端策略优化“删除”损失函数和代表高斯流程给出的预期改进获取功能的奖励项。这种新方法被证明与使用Mujoco Physics引擎模拟机器人运动的环境上的经验上相当，有时在经验上优于当前算法。

We propose a novel actor-critic, model-free reinforcement learning algorithm which employs a Bayesian method of parameter space exploration to solve environments. A Gaussian process is used to learn the expected return of a policy given the policy's parameters. The system is trained by updating the parameters using gradient descent on a new surrogate loss function consisting of the Proximal Policy Optimization 'Clipped' loss function and a bonus term representing the expected improvement acquisition function given by the Gaussian process. This new method is shown to be comparable to and at times empirically outperform current algorithms on environments that simulate robotic locomotion using the MuJoCo physics engine.

下载PDF全文

下载文献需遵守相关版权规定

论文标题