最佳的汤普森采样策略

论文标题

最佳的汤普森采样策略

Optimal Thompson Sampling strategies for support-aware CVaR bandits

论文作者

Baudry, Dorian, Gautron, Romain, Kaufmann, Emilie, Maillard, Odalric-Ambryn

论文摘要

在本文中，我们研究了一个多臂匪徒问题，其中每个臂的质量是通过奖励分布的某个级别alpha的风险（CVAR）的条件值来衡量的。尽管在此设置中的现有作品主要关注上限限制算法，但我们在有限的奖励上引入了一种新的Thompson采样方法，用于CVAR土匪，该方法足够灵活，可以解决基于物理资源的各种问题。在Riou＆Honda（2020年）的最新作品的基础上，我们引入了B-CVT，以进行连续有限的奖励和M-CVT，以用于多项式分布。从理论方面来说，我们提供了他们的分析的非平凡扩展，使理论上可以将其CVAR遗憾的最小化绩效约束。令人惊讶的是，我们的结果表明，这些策略是第一个在CVAR土匪中实现渐近最优性的策略，与此设置的相应渐近下限匹配。此外，我们从经验上说明了汤普森采样的益处，既可以在现实的环境中模拟农业中的用例和各种合成示例。

In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题