简化基于模型的RL：具有一个目标的学习表示，潜在空间模型和政策

论文标题

简化基于模型的RL：具有一个目标的学习表示，潜在空间模型和政策

Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective

论文作者

Ghugare, Raj, Bharadhwaj, Homanga, Eysenbach, Benjamin, Levine, Sergey, Salakhutdinov, Ruslan

论文摘要

尽管学习环境内部模型的增强学习（RL）方法具有比无模型对应物更有效的样本效率，但学会从高维传感器中对原始观察进行建模可能具有挑战性。先前的工作通过通过辅助目标（例如重建或价值预测）学习观察值的低维表示来解决了这一挑战。但是，这些辅助目标与RL目标之间的一致性通常不清楚。在这项工作中，我们提出了一个单一的目标，该目标共同优化了潜在空间模型和政策，以实现高回报，同时保持自洽。这个目标是预期收益的下限。与基于模型的RL在政策探索或模型保证方面的先前范围不同，我们的界限直接依靠整体RL目标。我们证明，所得算法匹配或提高了最佳基于模型和无模型的RL方法的样品效率。虽然样品有效的方法通常是计算要求的，但我们的方法在减少50％的壁式时间时间内达到了SAC的性能。

While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题