带有线性约束的随机匪徒

论文标题

带有线性约束的随机匪徒

Stochastic Bandits with Linear Constraints

论文作者

Pacchiano, Aldo, Ghavamzadeh, Mohammad, Bartlett, Peter, Jiang, Heinrich

论文摘要

我们研究了一个受约束的上下文线性匪徒设置，在该设置中，代理的目标是制定一系列政策，在$ t $回合的过程中，其预期的累积奖励是最大的，并且每个政策的预期奖励都低于一定的阈值$τ$。我们为此问题提出了一种上的限制算法，称为乐观的悲观线性强盗（OPLB），并证明了$ \ widetilde {\ mathcal {o}}}（\ frac {d \ frac {d \ sqrt {t}}}} $ bonding n y Is there nes $ there nese $ - 阈值和已知可行动作的成本。我们将结果进一步专门用于多臂匪徒，并为此设置提出了一种计算有效的算法。我们证明了$ \ widetilde {\ Mathcal {o}}（\ frac {\ sqrt {\ sqrt {kt}} {τ-c_0}）的$ in $ k $ armed bastits中的这种算法，这是$ \ \ sqrt的范围，我们的遗物是$ \ \ sqrt {线性匪徒并使用OPLB的遗憾。我们还证明了本文中研究的问题的较低限制，并提供了模拟以验证我们的理论结果。

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a certain threshold $τ$. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an $\widetilde{\mathcal{O}}(\frac{d\sqrt{T}}{τ-c_0})$ bound on its $T$-round regret, where the denominator is the difference between the constraint threshold and the cost of a known feasible action. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting. We prove a regret bound of $\widetilde{\mathcal{O}}(\frac{\sqrt{KT}}{τ- c_0})$ for this algorithm in $K$-armed bandits, which is a $\sqrt{K}$ improvement over the regret bound we obtain by simply casting multi-armed bandits as an instance of contextual linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题