论文标题
悲观主义对离线RL有效吗?
Is Pessimism Provably Efficient for Offline RL?
论文作者
论文摘要
我们研究离线增强学习(RL),该学习旨在基于数据集收集的数据集学习最佳政策。由于与环境缺乏进一步的相互作用,离线RL遭受了数据集覆盖不足的范围,该数据集掩盖了大多数现有的理论分析。在本文中,我们提出了值迭代算法(PEVI)的悲观变体,该变体将不确定性量化器作为惩罚函数。这样的惩罚功能只需翻转在线RL中促进探索的奖励功能的迹象,这使其易于实现并与一般功能近似器兼容。 在不假设数据集的足够覆盖范围的情况下,我们建立了对普通马尔可夫决策过程(MDPS)的PEVI次级优化的数据依赖性上限。当专门使用线性MDP时,它与信息理论的下限与维度和地平线的乘法因子相匹配。换句话说,悲观不仅是可证明的有效的,而且是最小的最佳选择。特别是,鉴于数据集,学到的政策是所有政策之间的“最佳努力”,因为没有其他政策无法做得更好。我们的理论分析确定了悲观主义在消除虚假相关概念中的关键作用,这是从数据集较少涵盖的“无关”的轨迹中出现的,对于最佳政策而言并不详尽。
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the "best effort" among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the "irrelevant" trajectories that are less covered by the dataset and not informative for the optimal policy.