论文标题
一种可行的无模型后取样方法,用于情节增强学习
A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning
论文作者
论文摘要
汤普森采样是上下文匪徒的最有效方法之一,已被推广到某些MDP设置的后验采样。但是,现有的增强学习后后抽样方法受到基于模型或缺乏线性MDP以外的最差理论保证的限制。本文提出了一种新的无模型后取样表述,该表达适用于更通用的情节增强学习问题,并具有理论保证。我们介绍了新颖的证明技术,以表明在适当的条件下,我们的后抽样方法的最遗憾与基于优化的方法的最著名结果相匹配。在具有维数的线性MDP设置中,与现有基于后采样的探索算法的二次依赖性相比,我们算法的遗憾与维度线性缩放。
Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees. We introduce novel proof techniques to show that under suitable conditions, the worst-case regret of our posterior sampling method matches the best known results of optimization based methods. In the linear MDP setting with dimension, the regret of our algorithm scales linearly with the dimension as compared to a quadratic dependence of the existing posterior sampling-based exploration algorithms.