论文标题
可复制的土匪
Replicable Bandits
论文作者
论文摘要
在本文中,我们在随机匪徒的背景下介绍了可复制政策的概念,这是交互式学习中的规范问题之一。如果匪徒环境中的策略在两个不同且独立的执行(即,在独立的奖励实现下)中完全相同的武器序列,则称其为复制。我们表明,不仅存在可复制的政策,而且在时间范围内,它们几乎达到了几乎相同的最佳(不可复制)后悔界限。更具体地说,在随机的多武器匪徒设置中,我们制定了一个具有最佳问题依赖性遗憾结合的策略,其对复制性参数的依赖也是最佳的。同样,对于随机线性匪徒(有限且无限的武器),我们制定了可复制的策略,以实现最著名的与问题无关的遗憾界限,并且对可复制性参数的最佳依赖性。我们的结果表明,即使随机化对于探索探索 - 探索至关重要,仍然可以在两轮执行中拉动完全相同的武器时达到最佳平衡。
In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.