有限MDP中的情节增强学习：最小值下限

论文标题

有限MDP中的情节增强学习：最小值下限

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

论文作者

Domingues, Omar Darwiche, Ménard, Pierre, Kaufmann, Emilie, Valko, Michal

论文摘要

在本文中，我们提出了针对偶发性MDP的样本复杂性和遗憾的新的独立的下限，并特别关注非平稳的情况，在该案例中，允许过渡内核在该情节的每个阶段进行更改。我们的主要贡献是$ω（（H^3SA/ε^2）\ log（1/δ））$的新颖下限对$（\ varepsilon，δ）$ - PAC算法的样品复杂性的最佳策略标识。该下限依赖于“硬MDP”的结构，该结构与文献中先前使用的构造不同。使用同一类的MDP，我们还提供了$ω（\ sqrt {h^3sat}）$遗憾的严格证明。最后，我们讨论与PAC-MDP下限的连接。

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ω((H^3SA/ε^2)\log(1/δ))$ on the sample complexity of an $(\varepsilon,δ)$-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $Ω(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题