论文标题
通过计划探索有关最佳轨迹的信息
Exploration via Planning for Information about the Optimal Trajectory
论文作者
论文摘要
增强学习(RL)的许多潜在应用都被学习有效政策所需的大量样本所困扰。将RL应用于现实世界控制任务时尤其如此,例如在科学或机器人技术中,在环境中执行政策的成本很高。在流行的RL算法中,代理通常通过在奖励最大化策略中添加随机性,或者通过尝试收集有关环境动态的最大信息而不考虑给定任务来探索。在这项工作中,我们开发了一种方法,使我们能够在考虑任务和当前有关动态的知识的同时计划探索。我们方法的关键见解是计划一个动作序列,以最大程度地提高有关手头任务的最佳轨迹的预期信息。我们证明,与强大的探索基线相比,在开放环和闭环控制设置中,在各种低到中等尺寸的控制任务上,您的方法比强探索基线要少2倍,样本少2倍,样本少2倍。
Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to real-world control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a reward-maximizing policy or by attempting to gather maximal information about environment dynamics without taking the given task into account. In this work, we develop a method that allows us to plan for exploration while taking both the task and the current knowledge about the dynamics into account. The key insight to our approach is to plan an action sequence that maximizes the expected information gain about the optimal trajectory for the task at hand. We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines and 200x fewer samples than model free methods on a diverse set of low-to-medium dimensional control tasks in both the open-loop and closed-loop control settings.