在政策梯度方法中，基于控制的基准基线用于指导探索

论文标题

在政策梯度方法中，基于控制的基准基线用于指导探索

Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

论文作者

Lyu, Xubo, Li, Site, Siriya, Seth, Pu, Ye, Chen, Mo

论文摘要

在本文中，为深度强化学习（RL）的策略梯度方法提供了一种新型的基于最佳控制的基线功能。基线是通过计算最佳控制问题的价值函数来获得的，该问题与RL任务密切相关。与旨在减少政策梯度估计方差的传统基线相反，我们的工作利用了最佳控制价值函数来引入基线作用的新颖方面 - 在政策学习过程中提供了指导的探索。这方面在先前的工作中不太讨论。我们验证了机器人学习任务的基准，显示了其在有指导探索中的有效性，尤其是在稀疏的奖励环境中。

In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题