论文标题
扩散政策是脱机强化学习的表达政策类别
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
论文作者
论文摘要
离线增强学习(RL)旨在使用先前收集的静态数据集学习最佳策略,是RL的重要范式。由于函数近似错误在分布外动作上的功能近似误差,因此在此制度中的标准RL方法通常的性能差。尽管已经提出了各种正则化方法来减轻此问题,但它们通常受到表现力有限的策略类别的限制,可以导致高度次优的解决方案。在本文中,我们提出将政策表示为扩散模型,这是最近一类高度表现的深层生成模型。我们介绍了利用条件扩散模型代表策略的扩散Q学习(扩散-QL)。在我们的方法中,我们学习了一个动作值函数,并在条件扩散模型的训练损失中添加了最大化动作值的术语,这导致损失寻求接近行为政策的最佳行动。我们展示了基于扩散模型的策略的表现力,以及在扩散模型下的行为克隆和策略改进的耦合都有助于扩散-QL的出色性能。我们说明了与先前的作品相比,在简单的2D强盗示例中,我们的方法的优势和多模式行为策略。然后,我们证明我们的方法可以在大多数D4RL基准任务上实现最先进的性能。
Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.