论文标题

对抗性决斗匪徒

Adversarial Dueling Bandits

论文作者

Saha, Aadirupa, Koren, Tomer, Mansour, Yishay

论文摘要

我们介绍了对抗性决斗匪徒的遗憾最小化问题。就像在经典的决斗匪徒中一样,学习者必须反复选择一对项目,并仅观察到这对相对二进制的“ Win-loss”反馈,但是在这里,此反馈是从任意优先矩阵中产生的,可能是在对抗性上选择的。我们的主要结果是一种算法,与\ emph {borda-winner}相比,其$ t $ round的遗憾是$ k $项目是$ \ tilde {o}(k^{1/3} t^{2/3})$,以及与匹配的$ω(k^^{1/3} t^^^2/3} $ plocy。我们还证明了类似的高概率遗憾。我们进一步考虑了一个更简单的\ emph {固定gap}对抗性设置,该设置在两个极端的偏好反馈模型之间桥接了针对决斗的匪徒:固定偏好和任意偏好序列。对于固定间隙的对抗设置,我们给出了$ \ smash {\ tilde {o}((((k/δ^2) $δ$紧密(取决于对数因素)。

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $Ω(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/Δ^2)\log{T}) }$ regret algorithm, where $Δ$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $Ω(K/Δ^2)$ indicating that our dependence on the main problem parameters $K$ and $Δ$ is tight (up to logarithmic factors).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源