渐近随机对照，并应用到土匪

论文标题

渐近随机对照，并应用到土匪

Asymptotic Randomised Control with applications to bandits

论文作者

Cohen, Samuel N., Treetanthiploet, Tanut

论文摘要

我们将一般的多军匪徒问题视为一个相关的（和简单的上下文和不安）元素，是一个放松的控制问题。通过引入熵正则化，我们获得了对值函数的平滑渐近近似。这产生了最佳决策过程的新型半指数近似。该半指数可以被解释为明确平衡探索探索 - 探索折衷的权衡，就像乐观的（UCB）原则中，学习溢价明确描述了环境中可用的信息的不对称性和奖励功能中的非线性。所得的渐近随机对照（ARC）算法的性能与其他相关的多臂匪徒的方法相比有利。

We consider a general multi-armed bandit problem with correlated (and simple contextual and restless) elements, as a relaxed control problem. By introducing an entropy regularisation, we obtain a smooth asymptotic approximation to the value function. This yields a novel semi-index approximation of the optimal decision process. This semi-index can be interpreted as explicitly balancing an exploration-exploitation trade-off as in the optimistic (UCB) principle where the learning premium explicitly describes asymmetry of information available in the environment and non-linearity in the reward function. Performance of the resulting Asymptotic Randomised Control (ARC) algorithm compares favourably well with other approaches to correlated multi-armed bandits.

下载PDF全文

下载文献需遵守相关版权规定

论文标题