论文标题
积极的基于偏好的高斯流程回归奖励学习
Active Preference-Based Gaussian Process Regression for Reward Learning
论文作者
论文摘要
在AI和机器人技术中,设计奖励功能是一个具有挑战性的问题。人类通常很难直接指定机器人需要优化的所有理想行为。一种常见的方法是从收集的专家演示中学习奖励功能。但是,从演示中学习奖励功能引入了许多挑战:某些方法需要高度结构化的模型,例如奖励功能在某些预定义的功能中是线性的,而其他功能则采用了较少的结构化奖励功能,而另一方面,这些功能需要大量的数据。此外,人类往往很难为具有高度自由度的机器人提供示威,甚至量化给定示范的奖励价值。为了应对这些挑战,我们提出了一种基于偏好的学习方法,作为替代方案,人类的反馈仅以轨迹之间的比较形式。此外,我们不假定奖励功能高度限制的结构。取而代之的是,我们使用高斯工艺(GP)对奖励函数进行建模,并提出数学公式,以仅使用人类偏好来积极找到GP。我们的方法使我们能够在基于偏好的学习框架内解决僵化和数据信息问题。我们在模拟和用户研究中的结果表明,我们的方法可以有效地学习机器人任务的表达奖励功能。
Designing reward functions is a challenging problem in AI and robotics. Humans usually have a difficult time directly specifying all the desirable behaviors that a robot needs to optimize. One common approach is to learn reward functions from collected expert demonstrations. However, learning reward functions from demonstrations introduces many challenges: some methods require highly structured models, e.g. reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that on the other hand require tremendous amount of data. In addition, humans tend to have a difficult time providing demonstrations on robots with high degrees of freedom, or even quantifying reward values for given demonstrations. To address these challenges, we present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories. Furthermore, we do not assume highly constrained structures on the reward function. Instead, we model the reward function using a Gaussian Process (GP) and propose a mathematical formulation to actively find a GP using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. Our results in simulations and a user study suggest that our approach can efficiently learn expressive reward functions for robotics tasks.