保守匪徒问题的一定大小的解决方案

论文标题

保守匪徒问题的一定大小的解决方案

A One-Size-Fits-All Solution to Conservative Bandit Problems

论文作者

Du, Yihan, Wang, Siwei, Huang, Longbo

论文摘要

在本文中，我们研究了一个具有样本路径奖励约束的保守匪徒问题（CBP）的家庭，即，学习者的奖励表现至少必须在任何时候与给定的基线一样。我们为CBP提出了一个千篇一律的解决方案，并将其应用呈现给三个包含的问题，即保守的多臂匪徒（CMAB），保守的线性匪徒（CLB）和保守的上下文组合量强盗（CCCB）。与以前考虑到预期奖励的高概率约束的作品不同，我们专注于对实际收到的奖励的样本限制，并获得更好的理论保证（$ t $独立的添加剂后悔，而不是$ t $依赖性）和经验绩效。此外，我们扩展了结果，并考虑了一种新型的保守均值划线问题（MV-CBP），该问题通过预期的奖励和可变性来衡量学习绩效。对于这个扩展的问题，我们提供了一种新颖的算法，其中包含$ O（1/T）$归一化的添加剂（$ t $依赖于累积形式），并通过经验评估来验证这一结果。

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题