论文标题
图形结构匪徒的最佳策略
Optimal Strategies for Graph-Structured Bandits
论文作者
论文摘要
我们研究了由一组由Bernoulli分布$ν\!= \!(ν\ _ {a,b})\ _ {a \ in \ nathcal {a a},b \ in \ intcal {b} $(a) \ in \ Mathcal {a},b \ in \ Mathcal {b}}} \!\ in \![0,1]^{\ Mathcal {a} \ times \ times \ times \ times \ mathcal {b}} $,并通过给定的重量matrix $ matrix $ matrix $ω\!= \ \! (ω\ _ {b,b'})\ _ {b,b'\ in \ mathcal {b}} $,其中$ \ mathcal {a} $是一组有限的武器,$ \ mathcal {b} $是一个有限的用户。重量矩阵$ω$使得对任何两个用户$ b,b'\!\ in \!\ mathcal {b},\ text {max} \ _ {a \ in \ mathcal {a}} |μ\ _ \ _ \ _ \ _ \ _ { μ\ _ {a,b'} | \!\ leq \! ω\ _ {b,b'} $。 这种配方足够灵活,可以捕获各种情况,从高度结构化的场景($ω\!\ in \!\!从用户和动作中采样奖励的动作。我们首先在这种通用的图形结构的遗憾中得出了依赖问题的下限,该图形结构涉及依赖结构的线性编程问题。其次,我们适应了本设置本田和Takemura(2015)引入的索引最小经验差异(IMED)算法,并介绍了IMED-GS $^\ star $算法。有趣的是,IMED-GS $^\ star $不需要计算线性编程问题的解决方案,而不是$ \ log(t)$ times $ t $ steps之后,而事实证明是渐近的最佳选择。同样,与为其他流行结构设计的现有强盗策略不同,IMED-GS $^\ star $不采用明确的强制探索计划,而仅利用当地的经验事件计数。我们最终提供了结果的数字说明,以确认IMED-GS $^\ star $的性能。
We study a structured variant of the multi-armed bandit problem specified by a set of Bernoulli distributions $ ν\!= \!(ν\_{a,b})\_{a \in \mathcal{A}, b \in \mathcal{B}}$ with means $(μ\_{a,b})\_{a \in \mathcal{A}, b \in \mathcal{B}}\!\in\![0,1]^{\mathcal{A}\times\mathcal{B}}$ and by a given weight matrix $ω\!=\! (ω\_{b,b'})\_{b,b' \in \mathcal{B}}$, where $ \mathcal{A}$ is a finite set of arms and $ \mathcal{B} $ is a finite set of users. The weight matrix $ω$ is such that for any two users $b,b'\!\in\!\mathcal{B}, \text{max}\_{a\in\mathcal{A}}|μ\_{a,b} \!-\! μ\_{a,b'}| \!\leq\! ω\_{b,b'} $. This formulation is flexible enough to capture various situations, from highly-structured scenarios ($ω\!\in\!\{0,1\}^{\mathcal{B}\times\mathcal{B}}$) to fully unstructured setups ($ω\!\equiv\! 1$).We consider two scenarios depending on whether the learner chooses only the actions to sample rewards from or both users and actions. We first derive problem-dependent lower bounds on the regret for this generic graph-structure that involves a structure dependent linear programming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura (2015), and introduce the IMED-GS$^\star$ algorithm. Interestingly, IMED-GS$^\star$ does not require computing the solution of the linear programming problem more than about $\log(T)$ times after $T$ steps, while being provably asymptotically optimal. Also, unlike existing bandit strategies designed for other popular structures, IMED-GS$^\star$ does not resort to an explicit forced exploration scheme and only makes use of local counts of empirical events. We finally provide numerical illustration of our results that confirm the performance of IMED-GS$^\star$.