在脱机政策评估中，近乎最佳的可证明的统一收敛

论文标题

在脱机政策评估中，近乎最佳的可证明的统一收敛

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

论文作者

Yin, Ming, Bai, Yu, Wang, Yu-Xiang

论文摘要

脱机政策评估（OPE）在加固学习（RL）中的问题是将RL应用于现实生活应用程序的关键一步。 OPE上的现有工作主要集中于评估固定目标策略$π$，该$π$并不能为离线政策学习提供有用的界限，因为$π$将是数据依赖的。我们通过同时评估OPE中的策略类别$π$中的所有策略来解决此问题，并在许多全球 /本地策略类中获得几乎最佳的错误界限。 Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_mε^2)$ in identifying an $ε$-optimal policy under the time-inhomogeneous episodic MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $μ$).据我们所知，这是首次显示出离线RL设置的最佳速率，而本文是第一个系统地研究OPE中均匀收敛的纸张。

The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy $π$, which does not provide useful bounds for offline policy learning as $π$ will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class $Π$ -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_mε^2)$ in identifying an $ε$-optimal policy under the time-inhomogeneous episodic MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $μ$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题