论文标题
截短的Linucb用于随机线性匪徒
Truncated LinUCB for Stochastic Linear Bandits
论文作者
论文摘要
本文考虑了具有有限数量的武器的上下文匪徒,其中上下文是独立的且分布的$ d $维随机向量,而预期的奖励在ARM参数和上下文中都是线性的。 Linucb算法对于相关线性匪徒而言是最佳的最低限度算法,该算法被证明具有累积的遗憾,由于其过度探索,在尺寸$ d $和Time Horizon $ t $中均具有次优。提出并称为“ tr-linucb”的截断版本,该版本遵循linucb到截断时间$ s $,然后进行纯剥削。如果$ s = cd \ log(t)$对于足够大的常数$ c $,并且建立了匹配的下限,则显示出Tr-Linucb算法可实现$ O(d \ log(t))$遗憾,这表明$ d $和$ t $在低维度下的$ t $中的速率优化性。此外,如果某些$κ> 1 $的$ s = d \ log^κ(t)$,与最佳相比的损失是乘法$ \ log \ log \ log \ log \ log(t)$ factor,这不取决于$ d $。对选择Tr-Linucb的截断时间过度造成的这种不敏感性是实际上的。
This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^κ(T)$ for some $κ>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.