有限参数化的多军匪徒的有限遗憾

论文标题

有限参数化的多军匪徒的有限遗憾

Bounded Regret for Finitely Parameterized Multi-Armed Bandits

论文作者

Panaganti, Kishan, Kalathil, Dileep

论文摘要

我们考虑有限参数化的多臂匪徒的问题，其中可以根据常见的未知参数来表征潜在的随机环境的模型。学习代理未知的真实参数。但是，有限的可能参数集已知先验。我们提出了一种简单易于实现的算法，我们称之为有限的参数化上限限制（FP-UCB）算法，该算法使用有关基础参数集的信息，以用于更快的学习。特别是，我们表明FP-UCB算法在基础参数集的某些结构条件下实现了有限的遗憾。我们还表明，如果基础参数集不满足必要的结构条件，则FP-UCB算法会实现对数遗憾，但与标准UCB算法相比，前常数较小。我们还通过广泛的数值模拟来验证FP-UCB算法的出色性能。

We consider the problem of finitely parameterized multi-armed bandits where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We propose an algorithm that is simple and easy to implement, which we call Finitely Parameterized Upper Confidence Bound (FP-UCB) algorithm, which uses the information about the underlying parameter set for faster learning. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard UCB algorithm. We also validate the superior performance of the FP-UCB algorithm through extensive numerical simulations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题