论文标题
行为克隆的可靠条件,用于离线加固学习
Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning
论文作者
论文摘要
行为克隆(BC)通过通过监督学习模仿离线轨迹,为离线RL提供了直接的解决方案。最近的进步(Chen等,2021; Janner等,2021; Emmons等,2021)表明,通过根据所需的未来回报进行调节,BC可以竞争基于价值的对应物,同时享受更多的简单性和训练稳定性。尽管有希望,但我们表明这些方法可能是不可靠的,因为在以高,较高的分发率(OOD)返回的情况下,它们的性能可能会大大降解。在实践中,这至关重要,因为我们通常希望通过以OOD值进行条件来表现更好的策略。我们表明,这种不可靠性既来自培训数据的次级优化和模型体系结构。我们提出了重量级行为克隆(CWBC),这是一种用两个关键组成部分提高条件BC可靠性的简单有效方法:轨迹加权和保守的正则化。加权轨迹使高返回轨迹加权,以减少卑诗省方法的火车测试差距,而保守的正规化器则鼓励该政策保持接近OOD条件的数据分配。我们在RVS(Emmons等,2021)和决策变压器(Chen等,2021)的背景下研究CWBC,并表明CWBC显着提高了它们在各种基准上的性能。
Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.