通过线性编程和遇到错误结合的约束，通过线性编程进行离线加强学习

论文标题

通过线性编程和遇到错误结合的约束，通过线性编程进行离线加强学习

Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

论文作者

Ozdaglar, Asuman, Pattathil, Sarath, Zhang, Jiawei, Zhang, Kaiqing

论文摘要

离线增强学习（RL）旨在使用预采用的数据集找到Markov决策过程（MDP）的最佳策略。在这项工作中，我们重新访问了离线RL的马尔可夫决策过程的线性编程（LP）重新制定，其目的是使用最佳$ O（1/\ sqrt {n}）$样本复杂性开发算法，其中$ n $是样本尺寸，部分数据覆盖和一般功能近似值，并具有良好的计算性迹象能力。为此，我们为LP的双重和原始偶对偶尔得出了新的\ emph {错误界限}，并将它们正确合并为\ emph {constraints}。然后，我们表明，在完整性类型的假设下，$ O（1/\ sqrt {n}）$样品复杂性可以在标准的单政策覆盖范围假设下实现，当一个正确\ emph {selaxe} LP中的占用有效性约束时。在一般函数近似和表格案例中，该框架可以很容易地处理无限 - 水平的折扣和平均奖励MDP。表格案例的实例化在这些设置中实现了最新的或离线RL的第一个样本复杂性。为了进一步删除任何完整性类型的假设，我们在LP中引入了适当的\ emph {下限约束}，并在标准的单政策覆盖范围假设中引入了一个变体。这样的算法导致$ O（1/\ sqrt {n}）$样本复杂性，依赖于\ emph {value-unction-unction gap}，并且只有实现性假设。我们适当限制的LP框架在几个方面都取得了现有的结果，在放松某些假设并实现最佳$ O（1/\ sqrt {n}）$样本复杂性的情况下，并通过简单的分析进行了样本。我们希望我们的结果通过通过错误结合的诱导约束来对使用LP公式的使用以及对离线RL的等效二重时最小值优化的新见解。

Offline reinforcement learning (RL) aims to find an optimal policy for Markov decision processes (MDPs) using a pre-collected dataset. In this work, we revisit the linear programming (LP) reformulation of Markov decision processes for offline RL, with the goal of developing algorithms with optimal $O(1/\sqrt{n})$ sample complexity, where $n$ is the sample size, under partial data coverage and general function approximation, and with favorable computational tractability. To this end, we derive new \emph{error bounds} for both the dual and primal-dual formulations of the LP, and incorporate them properly as \emph{constraints} in the LP reformulation. We then show that under a completeness-type assumption, $O(1/\sqrt{n})$ sample complexity can be achieved under standard single-policy coverage assumption, when one properly \emph{relaxes} the occupancy validity constraint in the LP. This framework can readily handle both infinite-horizon discounted and average-reward MDPs, in both general function approximation and tabular cases. The instantiation to the tabular case achieves either state-of-the-art or the first sample complexities of offline RL in these settings. To further remove any completeness-type assumption, we then introduce a proper \emph{lower-bound constraint} in the LP, and a variant of the standard single-policy coverage assumption. Such an algorithm leads to a $O(1/\sqrt{n})$ sample complexity with dependence on the \emph{value-function gap}, with only realizability assumptions. Our properly constrained LP framework advances the existing results in several aspects, in relaxing certain assumptions and achieving the optimal $O(1/\sqrt{n})$ sample complexity, with simple analyses. We hope our results bring new insights into the use of LP formulations and the equivalent primal-dual minimax optimization for offline RL, through the error-bound induced constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题