贪婪算法在多臂强盗中的不合理效力和许多武器

论文标题

贪婪算法在多臂强盗中的不合理效力和许多武器

The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

论文作者

Bayati, Mohsen, Hamidi, Nima, Johari, Ramesh, Khosravi, Khashayar

论文摘要

我们研究了\ emph {多武器}制度中的贝叶斯$ k $ armed强盗问题，其中$ k \ geq \ sqrt {t} $和$ t $代表时间范围。最初，并且与最近有关多臂匪徒问题的文献保持一致，我们观察到亚采样在设计最佳算法中起关键作用。常规的UCB算法是次优的，而在UCB框架下选择$θ（\ sqrt {t}）$ ARMS的$θ（\ sqrt {t}）$ ARMS的UCB中采样（SS-UCB），在UCB框架下执行，实现速率优先级。然而，尽管SS-UCB的理论希望对最佳遗憾有理论的希望，但与持续选择经验上最好的手臂的贪婪算法相比，它的表现不佳。该观察结果通过使用现实世界数据的模拟扩展到上下文设置。我们的发现提出了一种新形式的\ emph {Free Exploration}对贪婪的算法有益于多臂上下文，从根本上讲，与尾巴事件有关，涉及手臂奖励的先前分布。这一发现与自由探索的概念不同，该概念与协变量变化有关，如《上下文强盗文献》中最近讨论的。在扩展这些见解的过程中，我们确定了贪婪的贪婪方法不仅为多武装政权内的伯努利土匪带来了速率 - 异想天开，而且还使跨更广泛的分布都感到遗憾。总的来说，我们的研究表明，在多臂政权中，从业者可能会发现采用贪婪算法的价值更大。

We investigate a Bayesian $k$-armed bandit problem in the \emph{many-armed} regime, where $k \geq \sqrt{T}$ and $T$ represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects $Θ(\sqrt{T})$ arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题