平滑的强盗优化：对Hölder空间的概括

论文标题

平滑的强盗优化：对Hölder空间的概括

Smooth Bandit Optimization: Generalization to Hölder Space

论文作者

Liu, Yusha, Wang, Yining, Singh, Aarti

论文摘要

我们考虑了平稳奖励功能的匪徒优化，该功能是累积遗憾的最小化。 $α$-Hölder连续（包括Lipschitz）功能的$ 0 <α\ leq 1 $的功能已经研究了此问题。我们的主要结果是将奖励功能概括为具有指数$α> 1 $的Hölder空间，以弥合Lipschitz Bandits和无限差异性模型（例如线性匪徒）之间的间隙。对于Hölder连续功能，基于离散域的垃圾箱中随机采样的方法作为最佳。相比之下，我们提出了一类两层算法，这些算法在垃圾箱中部署误指定的线性/多项式匪徒算法。我们证明，所提出的算法可以通过得出$ \ tilde {o}的遗憾上限（t^\ frac {d+α} {d+2α}）$来利用函数的高阶平滑度，以获取$α> 1 $，这是$α> 1 $的。我们还研究了由$α$索引的Hölder空间的连续尺度上的未知功能平滑度的适应，并使用我们提出的两层算法采用了强盗模型选择方法。我们表明，它达到了遗憾率，与$α\ leq 1 $子集中的现有下限相匹配。

We consider bandit optimization of a smooth reward function, where the goal is cumulative regret minimization. This problem has been studied for $α$-Hölder continuous (including Lipschitz) functions with $0<α\leq 1$. Our main result is in generalization of the reward function to Hölder space with exponent $α>1$ to bridge the gap between Lipschitz bandits and infinitely-differentiable models such as linear bandits. For Hölder continuous functions, approaches based on random sampling in bins of a discretized domain suffices as optimal. In contrast, we propose a class of two-layer algorithms that deploy misspecified linear/polynomial bandit algorithms in bins. We demonstrate that the proposed algorithm can exploit higher-order smoothness of the function by deriving a regret upper bound of $\tilde{O}(T^\frac{d+α}{d+2α})$ for when $α>1$, which matches existing lower bound. We also study adaptation to unknown function smoothness over a continuous scale of Hölder spaces indexed by $α$, with a bandit model selection approach applied with our proposed two-layer algorithms. We show that it achieves regret rate that matches the existing lower bound for adaptation within the $α\leq 1$ subset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题