论文标题
因果混乱和基于偏好的奖励学习中的奖励错误识别
Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
论文作者
论文摘要
通过基于偏好的奖励学习学习政策是一种越来越流行的定制代理行为的方法,但轶事显示出容易出现虚假的相关性和奖励黑客行为的方法。尽管许多先前的工作着重于强化学习和行为克隆中的因果混乱,但我们专注于从偏好中学习时对因果混乱和奖励错误识别的系统研究。特别是,我们对几个基准领域进行了一系列灵敏度和消融分析,这些基准域中从偏好中学到的奖励达到了最小的测试错误,但未能推广到分布式状态 - 优化后的政策性能差。我们发现,存在非毒理干扰物特征,所陈述的偏好中的噪声以及部分状态可观察性都会加剧奖励错误识别。我们还确定了一组可以解释错误识别的学识渊博的方法。总的来说,我们观察到,优化误识的奖励可以使奖励的培训分配的政策推动了政策,从而获得了高度的预测(学习)奖励,但真实的奖励却很低。这些发现阐明了偏爱学习以奖励错误识别和因果混乱的敏感性 - 即使考虑到许多因素之一,也可能导致出乎意料的,不受欢迎的行为。
Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we focus on a systematic study of causal confusion and reward misidentification when learning from preferences. In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification. We also identify a set of methods with which to interpret misidentified learned rewards. In general, we observe that optimizing misidentified rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion -- failure to consider even one of many factors can result in unexpected, undesirable behavior.