纠正更大的土匪乐队：关于线性匪徒遗憾的案例研究

论文标题

纠正更大的土匪乐队：关于线性匪徒遗憾的案例研究

Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits

论文作者

Luo, Haipeng, Zhang, Mengxiao, Zhao, Peng, Zhou, Zhi-Hua

论文摘要

我们考虑将一系列对抗性强盗算法结合和学习的问题，目的是自适应地跟踪即时的最佳选择。 Agarwal等人的畜栏算法。（2017年）及其变体（Foster等，2020a）实现了这个目标，以遗憾的是$ \ widetilde {o}（\ sqrt {mt}）$的订单间接费用，其中$ m $是基本算法的数量，而$ t $ $ t $是Time Horvon。但是，多项式依赖性对$ m $，但是，将这些算法应用于$ m $为poly $（t）$甚至更大的许多应用程序。在这个问题的动机上，我们提出了一个新的食谱，以探讨更大的强盗算法，只要满足某些条件，遗憾的开销仅具有\ emph {对数}依赖于$ m $。作为主要示例，我们将食谱应用于$ d $ - 维$ \ ell_p $ unit-ball in（1,2] $的对抗性线性土匪的问题。 $ \ widetilde {o}（\ sqrt {d s t}）$与$ s $开关的一系列比较器竞争（对于某些已知的$ s $）时，我们将结果扩展到平稳且强烈的凸域以及不受约束的线性线性的线性线性划线。

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

下载PDF全文

下载文献需遵守相关版权规定

论文标题