速率自适应模型选择在黑框上下文集合算法的集合上

论文标题

速率自适应模型选择在黑框上下文集合算法的集合上

Rate-adaptive model selection over a collection of black-box contextual bandit algorithms

论文作者

Bibaut, Aurélien F., Chambaz, Antoine, van der Laan, Mark J.

论文摘要

我们考虑在随机上下文匪徒设置中的模型选择任务。假设我们获得了基本上下文的匪徒算法的集合。我们提供了一种将它们结合起来的主算法，并实现相同的性能，直到常数，如果是最好的基础算法，则它是自行运行的。我们的方法只要求每种算法满足很高的遗憾束缚。我们的过程非常简单，本质上可以做以下操作：对于一个概率选择的顺序$（p_ {t}）_ {t \ geq 1} $，在每个回合$ t $中，它要么随机选择候选人要遵循的候选者（概率$ p _ {t} $），要么以相同的概率进行比较，每种候选人的样本大小，并在每个候选人中赢得奖励，并赢得了累积的奖励，并赢得了累积的奖励，并赢得了累积的奖励，并赢得了累积的奖励。 $ 1-P_ {T} $）。据我们所知，我们的建议是第一个成为一般黑框上下文匪徒算法的速率自适应的建议：它与最佳候选人相同的遗憾率。我们通过模拟研究证明了我们方法的有效性。

We consider the model selection task in the stochastic contextual bandit setting. Suppose we are given a collection of base contextual bandit algorithms. We provide a master algorithm that combines them and achieves the same performance, up to constants, as the best base algorithm would, if it had been run on its own. Our approach only requires that each algorithm satisfy a high probability regret bound. Our procedure is very simple and essentially does the following: for a well chosen sequence of probabilities $(p_{t})_{t\geq 1}$, at each round $t$, it either chooses at random which candidate to follow (with probability $p_{t}$) or compares, at the same internal sample size for each candidate, the cumulative reward of each, and selects the one that wins the comparison (with probability $1-p_{t}$). To the best of our knowledge, our proposal is the first one to be rate-adaptive for a collection of general black-box contextual bandit algorithms: it achieves the same regret rate as the best candidate. We demonstrate the effectiveness of our method with simulation studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题