论文标题
通过可解释的多个实例学习,从轨迹标签中从轨迹标签中进行非马克维亚奖励建模
Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning
论文作者
论文摘要
我们概括了奖励建模的问题(RM)用于加固学习(RL)来处理非马克维亚奖励。现有工作假设人类评估者在提供代理行为的反馈时独立观察轨迹中的每个步骤。在这项工作中,我们删除了这一假设,扩展了RM以捕获人类轨迹评估中的时间依赖性。我们展示了如何将RM作为多个实例学习(MIL)问题接近,其中轨迹被视为带有回返回标签的袋子,而轨迹内的步骤是具有看不见的奖励标签的实例。我们继续开发新的MIL模型,这些模型能够捕获标记的轨迹中的时间依赖性。我们在一系列RL任务上证明了我们的新型MIL模型可以将奖励功能重建至高度的准确性,并且可用于训练高性能的代理策略。
We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to capture temporal dependencies in human assessment of trajectories. We show how RM can be approached as a multiple instance learning (MIL) problem, where trajectories are treated as bags with return labels, and steps within the trajectories are instances with unseen reward labels. We go on to develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and can be used to train high-performing agent policies.