通过学习外部价值函数来衡量元梯度加强学习

论文标题

通过学习外部价值函数来衡量元梯度加强学习

Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer Value Function

论文作者

Bonnet, Clément, Midgley, Laurence, Laterre, Alexandre

论文摘要

元梯度增强学习（RL）允许代理商在培训期间以在线方式自我调整超级参数。在本文中，我们确定了当前元梯度RL方法的元梯度的偏差。这种偏见来自使用使用元学习折扣因子训练的评论家来进行外部目标的优势估计，这需要不同的折现因子。由于元学习的折扣因子通常低于外部物镜中使用的折扣因子，因此所产生的偏见会导致元梯度偏爱近视政策。我们对此问题提出了一个简单的解决方案：我们通过在外部损失的估计中使用\ emph {ofter}的值函数消除了这种偏见。为了获得此外部值函数，我们使用外部损失折扣系数将第二个头部添加到评论家网络，并与经典评论家一起训练它。在一个说明性的玩具问题上，我们表明偏见会导致当前元梯度RL方法的灾难性失败，并表明我们提出的解决方案可以修复它。然后，我们将方法应用于更复杂的环境，并证明固定元梯度偏差可以显着提高性能。

Meta-gradient Reinforcement Learning (RL) allows agents to self-tune their hyper-parameters in an online fashion during training. In this paper, we identify a bias in the meta-gradient of current meta-gradient RL approaches. This bias comes from using the critic that is trained using the meta-learned discount factor for the advantage estimation in the outer objective which requires a different discount factor. Because the meta-learned discount factor is typically lower than the one used in the outer objective, the resulting bias can cause the meta-gradient to favor myopic policies. We propose a simple solution to this issue: we eliminate this bias by using an alternative, \emph{outer} value function in the estimation of the outer loss. To obtain this outer value function we add a second head to the critic network and train it alongside the classic critic, using the outer loss discount factor. On an illustrative toy problem, we show that the bias can cause catastrophic failure of current meta-gradient RL approaches, and show that our proposed solution fixes it. We then apply our method to a more complex environment and demonstrate that fixing the meta-gradient bias can significantly improve performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题