混合RL：使用离线和在线数据都可以提高RL

论文标题

混合RL：使用离线和在线数据都可以提高RL

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

论文作者

Song, Yuda, Zhou, Yifei, Sekhari, Ayush, Bagnell, J. Andrew, Krishnamurthy, Akshay, Sun, Wen

论文摘要

我们考虑了混合增强学习设置（混合RL），其中代理可以访问离线数据集，并且能够通过现实世界的在线互动来收集经验。该框架减轻了纯离线和在线RL设置中出现的挑战，从而在理论和实践中都可以设计简单且高效的算法。我们通过将经典的Q学习/迭代算法适应混合设置来证明这些优势，我们称之为混合Q学习或HY-Q。在我们的理论结果中，我们证明，只要离线数据集支持高质量的策略，算法在计算和统计上都具有效率，并且环境具有双线性等级。值得注意的是，与政策梯度/迭代方法的保证相比，我们不需要对初始分布提供的覆盖范围的假设。在我们的实验结果中，我们表明，具有神经网络功能近似值的HY-Q优于在线，离线和混合RL基线的最先进的基准，包括蒙特祖玛的复仇。

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma's Revenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题