论文标题
有限MDP中的情节增强学习:最小值下限
Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
论文作者
论文摘要
在本文中,我们提出了针对偶发性MDP的样本复杂性和遗憾的新的独立的下限,并特别关注非平稳的情况,在该案例中,允许过渡内核在该情节的每个阶段进行更改。我们的主要贡献是$ω((H^3SA/ε^2)\ log(1/δ))$的新颖下限对$(\ varepsilon,δ)$ - PAC算法的样品复杂性的最佳策略标识。该下限依赖于“硬MDP”的结构,该结构与文献中先前使用的构造不同。使用同一类的MDP,我们还提供了$ω(\ sqrt {h^3sat})$遗憾的严格证明。最后,我们讨论与PAC-MDP下限的连接。
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ω((H^3SA/ε^2)\log(1/δ))$ on the sample complexity of an $(\varepsilon,δ)$-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $Ω(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.