协作多代理随机线性土匪

论文标题

协作多代理随机线性土匪

Collaborative Multi-agent Stochastic Linear Bandits

论文作者

Moradipari, Ahmadreza, Ghavamzadeh, Mohammad, Alizadeh, Mahnoosh

论文摘要

我们研究了一个协作的多代理随机线性匪徒设置，其中形成网络的$ n $代理在本地通信以最大程度地减少其整体遗憾。在这种情况下，每个代理都有自己的线性匪徒问题（其自身的奖励参数），目标是选择最佳的全球动作W.R.T.他们的奖励参数的平均值。在每个回合中，每个代理都提出了一个动作，并且一个动作被随机选择并作为网络动作播放。所有代理人都会观察到播放动作的相应奖励，并使用加速共识程序来计算所有代理商获得的奖励平均值的估计。我们提出了分布式的上限限制（UCB）算法，并证明了其$ t $ rond遗憾的概率很高，其中我们包括与每个通讯回合相关的遗憾的线性增长。我们的遗憾是$ \ Mathcal {o} \ big（\ sqrt {\ frac {t} {n \ log（1/| |λ_2|）}} \ cdot（\ log t）^2 \ big）

We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|λ_2|)}}\cdot (\log T)^2\Big)$, where $λ_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.

下载PDF全文

下载文献需遵守相关版权规定

论文标题