论文标题
缓慢变化的对抗性匪徒算法对于折扣的MDP有效
Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs
论文作者
论文摘要
加强学习概括了多军匪徒问题,并在更长的计划范围和未知的过渡内核方面遇到了其他困难。我们探索了从折扣的无限摩恩表格加固学习到多臂匪徒的黑盒减少,尤其是在每个州都放置一个独立的强盗学习者。我们表明,在渴望和快速的混合假设下,任何在对抗性匪徒设置中实现最佳遗憾的对抗性强盗算法也可以在Infinite-Horizon折扣马尔可夫决策过程中获得最佳的预期后悔,这对于回合$ T $的数量。此外,我们使用指数重量算法的特定实例检查了我们的还原。
Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition kernel. We explore a black-box reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits, where, specifically, an independent bandit learner is placed in each state. We show that, under ergodicity and fast mixing assumptions, any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial bandit setting can also attain optimal expected regret in infinite-horizon discounted Markov decision processes, with respect to the number of rounds $T$. Furthermore, we examine our reduction using a specific instance of the exponential-weight algorithm.