对抗性决斗匪徒

论文标题

对抗性决斗匪徒

Adversarial Dueling Bandits

论文作者

Saha, Aadirupa, Koren, Tomer, Mansour, Yishay

论文摘要

我们介绍了对抗性决斗匪徒的遗憾最小化问题。就像在经典的决斗匪徒中一样，学习者必须反复选择一对项目，并仅观察到这对相对二进制的“ Win-loss”反馈，但是在这里，此反馈是从任意优先矩阵中产生的，可能是在对抗性上选择的。我们的主要结果是一种算法，与\ emph {borda-winner}相比，其$ t $ round的遗憾是$ k $项目是$ \ tilde {o}（k^{1/3} t^{2/3}）$，以及与匹配的$ω（k^^{1/3} t^^^2/3} $ plocy。我们还证明了类似的高概率遗憾。我们进一步考虑了一个更简单的\ emph {固定gap}对抗性设置，该设置在两个极端的偏好反馈模型之间桥接了针对决斗的匪徒：固定偏好和任意偏好序列。对于固定间隙的对抗设置，我们给出了$ \ smash {\ tilde {o}（（（（k/δ^2） $δ$紧密（取决于对数因素）。

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose $T$-round regret compared to the \emph{Borda-winner} from a set of $K$ items is $\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $Ω(K^{1/3}T^{2/3})$ lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/Δ^2)\log{T}) }$ regret algorithm, where $Δ$ is the gap in Borda scores between the best item and all other items, and show a lower bound of $Ω(K/Δ^2)$ indicating that our dependence on the main problem parameters $K$ and $Δ$ is tight (up to logarithmic factors).

下载PDF全文

下载文献需遵守相关版权规定

论文标题