分散的随机多人多武器步行匪

论文标题

分散的随机多人多武器步行匪

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

论文作者

Xiong, Guojun, Li, Jian

论文摘要

多武器多军强盗是一个日益相关的决策问题，这是由于对认知无线电系统的应用。大多数针对此问题的研究都专注于玩家对所有武器的\ textit {Full访问}的设置，并且在拉上同一支手臂时不会获得任何奖励。因此，所有玩家都以最大程度地提高累积奖励的目标解决了相同的匪徒问题。但是，这些设置忽略了许多现实世界中的几个重要因素，在这些应用程序中，玩家对\ textIt {有限的访问}对\ textIt {动态的局部武器子集}（即，有时可能是``行走''的手臂''''''''''''''''''''''。为此，本文提出了一个\ textIt {多武器多臂步行匪徒}模型，旨在解决上述建模问题。现在的目标是最大程度地提高奖励，但是，玩家只能从当地子集中拉动手臂，只有在没有其他玩家拉上同一支手臂的情况下才能获得全部奖励。我们采用上限（UCB）来处理探索 - 探索折衷方案，并采用分布式优化技术来正确处理碰撞。通过仔细整合这两种技术，我们提出了一种遗憾的分散算法，并以近乎最佳的保证，可以轻松实施以获得竞争性的经验表现。

Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have \textit{full access} to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have \textit{limited access} to \textit{a dynamic local subset of arms} (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a \textit{multi-player multi-armed walking bandits} model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题