发现增强学习算法

论文标题

发现增强学习算法

Discovering Reinforcement Learning Algorithms

论文作者

Oh, Junhyuk, Hessel, Matteo, Czarnecki, Wojciech M., Xu, Zhongwen, van Hasselt, Hado, Singh, Satinder, Silver, David

论文摘要

强化学习（RL）算法根据通过多年的研究手动发现的几个可能的规则之一来更新代理的参数。自动化从数据发现更新规则可能会导致更有效的算法或更好地适应特定环境的算法。尽管事先试图解决这一重大科学挑战，但仍然是一个悬而未决的问题，是否可以发现RL基本概念（例如价值函数和时间差异学习）是否可行。本文介绍了一种新的元学习方法，该方法发现了整个更新规则，其中包括“要预测的内容”（例如，值函数）和“如何从中学习”（例如，引导）通过与一组环境进行交互。该方法的输出是我们称之为学习策略梯度（LPG）的RL算法。经验结果表明，我们的方法发现了自己的价值函数概念的替代方案。此外，它发现了一种自举机制来维护和使用其预测。令人惊讶的是，当仅在玩具环境中接受培训时，LPG有效地将其概括为复杂的Atari游戏，并取得了非平凡的表现。这显示了从数据发现一般RL算法的潜力。

Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific challenge, it remains an open question whether it is feasible to discover alternatives to fundamental concepts of RL such as value functions and temporal-difference learning. This paper introduces a new meta-learning approach that discovers an entire update rule which includes both 'what to predict' (e.g. value functions) and 'how to learn from it' (e.g. bootstrapping) by interacting with a set of environments. The output of this method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. Surprisingly, when trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance. This shows the potential to discover general RL algorithms from data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题