论文标题

搅拌$^2 $:在稀疏奖励任务上合并增强和模仿学习的奖励重新列表

STIR$^2$: Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

论文作者

Martin, Jesus Bujalance, Moutarde, Fabien

论文摘要

在搜索更有效的样品增强学习(RL)算法时,一个有希望的方向是利用尽可能多的外部外盘数据。例如,专家演示。过去,已经提出了多种想法来充分利用添加到重播缓冲区中的演示,例如仅在演示中进行审议或最大程度地减少额外的成本功能。我们提出了一种新方法,能够利用任何稀疏的奖励环境中通过任何非政策算法在线收集的演示和情节。我们的方法是基于对示范和成功的情节(通过重新标签)的奖励加成的,鼓励专家模仿和自我模拟。我们的实验集中在两个不同的模拟环境中的几个机器人操作任务上。我们表明,基于奖励重新标签的方法可改善这些任务对基本算法(SAC和DDPG)的性能。最后,我们最好的算法搅拌$^2 $(通过奖励重新标记为自我和教师模仿),该算法与以前的作品进行了多次改进,比所有基本线都更高。

In the search for more sample-efficient reinforcement-learning (RL) algorithms, a promising direction is to leverage as much external off-policy data as possible. For instance, expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage both demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes (via relabeling), encouraging expert imitation and self-imitation. Our experiments focus on several robotic-manipulation tasks across two different simulation environments. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks. Finally, our best algorithm STIR$^2$ (Self and Teacher Imitation by Reward Relabeling), which integrates into our method multiple improvements from previous works, is more data-efficient than all baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源