仔细观察带有图形反馈的匪徒的小损失边界

论文标题

仔细观察带有图形反馈的匪徒的小损失边界

A Closer Look at Small-loss Bounds for Bandits with Graph Feedback

论文作者

Lee, Chung-Wei, Luo, Haipeng, Zhang, Mengxiao

论文摘要

我们研究带有图形反馈的对抗性多臂匪徒的小损失界，也就是说，自适应遗憾界限取决于最佳臂或相关数量的损失，而不是回合总数。我们得出了第一个用于一般可观察到的图形的小损坏，解决了Lykouris等人的开放问题。（2018）。具体来说，我们开发了一种遗憾的算法$ \ nathcal {\ tilde {o}}}（\ sqrt {\ sqrt {κL_*}）$，其中$κ$是集团分区编号，$ l _*$是损失的最佳武器，并且对于每个人的遗憾，我们都会对自我的遗憾进行自我遗憾，我们对自我的遗憾有所改善，我们是自我遗憾的。 $ \ Mathcal {\ tilde {o}}（\ min \ {\ sqrt {αt}，\ sqrt {κL_*} \}）$，其中$α\ leqκ$是独立编号。我们的结果大大改善了Lykouris等人的结果。（2018）只考虑自我意识的图形。此外，我们还首次尝试得出弱观察图的小损失边界。我们首先证明在这种情况下没有典型的小损失边界，然后就某些特定的武器子集的丢失而提出具有替代性小损失界限的算法。令人惊讶的一面结果是，$ \ Mathcal {\ tilde {o}}（\ sqrt {t}）$遗憾即使对于弱可观察到的图形也可以实现，只要最好的手臂具有自动摇动。我们的算法基于在线镜下降框架，但需要一套可能具有独立感兴趣的新技术。此外，我们所有的算法都可以在不了解环境的情况下无参数。

We study small-loss bounds for adversarial multi-armed bandits with graph feedback, that is, adaptive regret bounds that depend on the loss of the best arm or related quantities, instead of the total number of rounds. We derive the first small-loss bound for general strongly observable graphs, resolving an open problem of Lykouris et al. (2018). Specifically, we develop an algorithm with regret $\mathcal{\tilde{O}}(\sqrt{κL_*})$ where $κ$ is the clique partition number and $L_*$ is the loss of the best arm, and for the special case of self-aware graphs where every arm has a self-loop, we improve the regret to $\mathcal{\tilde{O}}(\min\{\sqrt{αT}, \sqrt{κL_*}\})$ where $α\leq κ$ is the independence number. Our results significantly improve and extend those by Lykouris et al. (2018) who only consider self-aware undirected graphs. Furthermore, we also take the first attempt at deriving small-loss bounds for weakly observable graphs. We first prove that no typical small-loss bounds are achievable in this case, and then propose algorithms with alternative small-loss bounds in terms of the loss of some specific subset of arms. A surprising side result is that $\mathcal{\tilde{O}}(\sqrt{T})$ regret is achievable even for weakly observable graphs as long as the best arm has a self-loop. Our algorithms are based on the Online Mirror Descent framework but require a suite of novel techniques that might be of independent interest. Moreover, all our algorithms can be made parameter-free without the knowledge of the environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题