论文标题

探索具有行动平衡的未知状态

Exploring Unknown States with Action Balance

论文作者

Song, Yan, Chen, Yingfeng, Hu, Yujing, Fan, Changjie

论文摘要

探索是增强学习的关键问题。最近,基于奖金的方法在探索很困难的环境中取得了巨大的成功,例如蒙特祖玛的复仇,这些复仇分配了额外的奖金(例如,内在的奖励),以指导代理商很少访问的州。由于执行动作后,根据下一个状态的新颖性来计算奖金,因此我们称之为下一个国家奖励方法。但是,下一步的奖金方法迫使代理在探索已知状态时要引起人们的注意,而忽略了发现未知状态,因为探索是由已经访问的下一个状态驱动的,这可能会减慢在某些环境中找到奖励的步伐。在本文中,我们专注于提高寻找未知状态并提出行动平衡探索的有效性,该探索平衡了在给定状态下选择每个动作的频率,并且可以视为上限置信度结合(UCB)的扩展到深度增强的学习。此外,我们提出了将下一个国家奖励方法(例如随机网络蒸馏探索,RND)结合起来的行动平衡RND和我们的行动平衡探索以利用双方的优势。在网格世界和Atari游戏上的实验表明,动作平衡探索在寻找未知状态的能力中具有更好的能力,并且可以分别在某些硬探索环境中提高RND的性能。

Exploration is a key problem in reinforcement learning. Recently bonus-based methods have achieved considerable successes in environments where exploration is difficult such as Montezuma's Revenge, which assign additional bonuses (e.g., intrinsic rewards) to guide the agent to rarely visited states. Since the bonus is calculated according to the novelty of the next state after performing an action, we call such methods as the next-state bonus methods. However, the next-state bonus methods force the agent to pay overmuch attention in exploring known states and ignore finding unknown states since the exploration is driven by the next state already visited, which may slow the pace of finding reward in some environments. In this paper, we focus on improving the effectiveness of finding unknown states and propose action balance exploration, which balances the frequency of selecting each action at a given state and can be treated as an extension of upper confidence bound (UCB) to deep reinforcement learning. Moreover, we propose action balance RND that combines the next-state bonus methods (e.g., random network distillation exploration, RND) and our action balance exploration to take advantage of both sides. The experiments on the grid world and Atari games demonstrate action balance exploration has a better capability in finding unknown states and can improve the performance of RND in some hard exploration environments respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源