论文标题

定期MDP的在线加强学习

Online Reinforcement Learning for Periodic MDP

论文作者

Aniket, Ayush, Chattopadhyay, Arpan

论文摘要

我们在定期马尔可夫决策过程(MDP)中学习学习,这是一种特殊类型的非平稳MDP,在平均奖励最大化设置下,状态过渡概率和奖励功能都定期变化。我们通过使用周期指数来扩大状态空间,并提出一个周期性的上置信度结合增强学习-2(PUCRL2)算法,将问题作为固定的MDP提出。我们表明,pucrl2的遗憾随周期和次线性与地平线长度有线性变化。数值结果证明了PUCRL2的功效。

We study learning in periodic Markov Decision Process(MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period and as sub-linear with the horizon length. Numerical results demonstrate the efficacy of PUCRL2.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源