基于模型的批评家学习政策梯度的学习

论文标题

基于模型的批评家学习政策梯度的学习

Model Based Meta Learning of Critics for Policy Gradients

论文作者

Bechtle, Sarah, Righetti, Ludovic, Meier, Franziska

论文摘要

能够无缝地跨越不同的任务，这对于机器人在我们的世界中行事至关重要。但是，迅速推广到新场景的学习表征仍然是强化学习的开放研究问题。在本文中，我们为基于梯度的政策学习提供了一个框架。具体而言，我们提出了一种基于模型的双层优化算法，该算法更新了评论家参数，以便使用更新的评论家学习的策略更接近解决元训练任务。我们说明我们的算法导致学到的批评家类似于地面真理Q功能的功能。最后，经过元培训，可以使用博学的评论家通过无需模型来学习新的未见任务和环境设置的新政策，而无需模型。我们提出的结果表明，当我们在新方案中学习新政策时，学到的评论家对新任务和动态的概括能力。

Being able to seamlessly generalize across different tasks is fundamental for robots to act in our world. However, learning representations that generalize quickly to new scenarios is still an open research problem in reinforcement learning. In this paper we present a framework to meta-learn the critic for gradient-based policy learning. Concretely, we propose a model-based bi-level optimization algorithm that updates the critics parameters such that the policy that is learned with the updated critic gets closer to solving the meta-training tasks. We illustrate that our algorithm leads to learned critics that resemble the ground truth Q function for a given task. Finally, after meta-training, the learned critic can be used to learn new policies for new unseen task and environment settings via model-free policy gradient optimization, without requiring a model. We present results that show the generalization capabilities of our learned critic to new tasks and dynamics when used to learn a new policy in a new scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题