论文标题
一位反馈足以满足上限限制政策
One-bit feedback is sufficient for upper confidence bound policies
论文作者
论文摘要
我们考虑了传统的多军强盗问题的一种变体,其中每个手臂只能根据其过去的奖励历史在每次拉动期间提供一位反馈。我们的主要结果是:鉴于使用全奖励反馈的上限置信策略,存在一个用于产生一位反馈的编码方案,以及相应的解码方案和手臂选择策略,以使我们的政策所达到的遗憾和遗憾的遗憾,使全面奖励反馈策略均非近距离接近。
We consider a variant of the traditional multi-armed bandit problem in which each arm is only able to provide one-bit feedback during each pull based on its past history of rewards. Our main result is the following: given an upper confidence bound policy which uses full-reward feedback, there exists a coding scheme for generating one-bit feedback, and a corresponding decoding scheme and arm selection policy, such that the ratio of the regret achieved by our policy and the regret of the full-reward feedback policy asymptotically approaches one.