在具有半伴兰的反馈和有限预算的非策略组合匪内找到最佳的武器

论文标题

在具有半伴兰的反馈和有限预算的非策略组合匪内找到最佳的武器

Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget

论文作者

Brandt, Jasmin, Bengs, Viktor, Haddenhorst, Björn, Hüllermeier, Eyke

论文摘要

我们认为在有限的采样预算限制下，与半一样的反馈存在组合匪徒问题，其中学习者只能在总体预算指定的有限次次数中执行其行动。动作是选择一组武器，随后收到所选组中每个臂的反馈。与现有作品不同，我们在具有子集依赖的反馈的非策略环境中研究了这个问题，即，收到的半伴侣反馈可能是由遗忘的对手产生的，也可能取决于所选的武器集。此外，我们考虑了涵盖基于数值的案例和基于偏好的案例的一般反馈方案，并为此设置引入了一个合理的理论框架，以保证学习者寻求找到最佳武器的明智概念。我们建议一种通用算法，适合涵盖从积极到保守的各种可疑的手臂消除策略。关于算法足够和必要的预算以找到最佳手臂的理论问题，通过为此问题方案的任何学习算法得出下界的回答和补充。

We consider the combinatorial bandits problem with semi-bandit feedback under finite sampling budget constraints, in which the learner can carry out its action only for a limited number of times specified by an overall budget. The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received. Unlike existing works, we study this problem in a non-stochastic setting with subset-dependent feedback, i.e., the semi-bandit feedback received could be generated by an oblivious adversary and also might depend on the chosen set of arms. In addition, we consider a general feedback scenario covering both the numerical-based as well as preference-based case and introduce a sound theoretical framework for this setting guaranteeing sensible notions of optimal arms, which a learner seeks to find. We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies from aggressive to conservative. Theoretical questions about the sufficient and necessary budget of the algorithm to find the best arm are answered and complemented by deriving lower bounds for any learning algorithm for this problem scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题