远程上下文匪徒

论文标题

远程上下文匪徒

Remote Contextual Bandits

论文作者

Pase, Francesco, Gunduz, Deniz, Zorzi, Michele

论文摘要

我们考虑了一个远程上下文的多军强盗（CMAB）问题，其中决策者观察上下文和奖励，但必须传达代理商通过利率限制的通信渠道采取的行动。例如，这可以建模一个个性化的广告放置应用程序，内容所有者在其中观察到其网站的各个访问者，因此具有上下文信息，但必须将必须向每个访问者显示的广告传达给管理营销内容的独立实体。在这个远程CMAB（R-CMAB）问题中，对决策者和代理商之间的通信率的限制在每个代理发送的位数与获得的平均奖励之间实施了权衡。我们特别有兴趣表征实现次线性遗憾所需的利率。因此，这可以被视为一个政策压缩问题，其中失真度量是由学习目标引起的。我们首先通过让代理商数量进入无限，并研究采用汤普森采样策略时所取得的遗憾，从而研究了该问题的基本信息理论限制。特别是，我们分别确定了两个不同的速率区域，分别导致线性和次线性遗憾行为。然后，当决策者可以可靠地传输政策而不会失真时，我们就会为可实现的遗憾提供上限。

We consider a remote contextual multi-armed bandit (CMAB) problem, in which the decision-maker observes the context and the reward, but must communicate the actions to be taken by the agents over a rate-limited communication channel. This can model, for example, a personalized ad placement application, where the content owner observes the individual visitors to its website, and hence has the context information, but must convey the ads that must be shown to each visitor to a separate entity that manages the marketing content. In this remote CMAB (R-CMAB) problem, the constraint on the communication rate between the decision-maker and the agents imposes a trade-off between the number of bits sent per agent and the acquired average reward. We are particularly interested in characterizing the rate required to achieve sub-linear regret. Consequently, this can be considered as a policy compression problem, where the distortion metric is induced by the learning objectives. We first study the fundamental information theoretic limits of this problem by letting the number of agents go to infinity, and study the regret achieved when Thompson sampling strategy is adopted. In particular, we identify two distinct rate regions resulting in linear and sub-linear regret behavior, respectively. Then, we provide upper bounds on the achievable regret when the decision-maker can reliably transmit the policy without distortion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题