论文标题
有限参数化的多军匪徒的有限遗憾
Bounded Regret for Finitely Parameterized Multi-Armed Bandits
论文作者
论文摘要
我们考虑有限参数化的多臂匪徒的问题,其中可以根据常见的未知参数来表征潜在的随机环境的模型。学习代理未知的真实参数。但是,有限的可能参数集已知先验。我们提出了一种简单易于实现的算法,我们称之为有限的参数化上限限制(FP-UCB)算法,该算法使用有关基础参数集的信息,以用于更快的学习。特别是,我们表明FP-UCB算法在基础参数集的某些结构条件下实现了有限的遗憾。我们还表明,如果基础参数集不满足必要的结构条件,则FP-UCB算法会实现对数遗憾,但与标准UCB算法相比,前常数较小。我们还通过广泛的数值模拟来验证FP-UCB算法的出色性能。
We consider the problem of finitely parameterized multi-armed bandits where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We propose an algorithm that is simple and easy to implement, which we call Finitely Parameterized Upper Confidence Bound (FP-UCB) algorithm, which uses the information about the underlying parameter set for faster learning. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard UCB algorithm. We also validate the superior performance of the FP-UCB algorithm through extensive numerical simulations.