具有丰富动作集及其对推理的影响的线性土匪探索

论文标题

具有丰富动作集及其对推理的影响的线性土匪探索

Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference

论文作者

Banerjee, Debangshu, Ghosh, Avishek, Chowdhury, Sayak Ray, Gopalan, Aditya

论文摘要

当动作集具有良好的曲率时，我们在任何线性匪徒算法产生的设计矩阵的特征矩阵的特征性矩阵的特征谱上提出了一个非反应下限。 Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $Ω(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm.这表明，这种动作空间在离散的（即分离良好的）动作空间中迫使多项式下限而不是对数下限，如\ cite {lattimore2017end}所示。此外，虽然先前的结果仅在渐近方案（如$ n \ to \ infty $）中保存，但我们对这些“本地富裕”动作空间的结果是随时。此外，在轻度的技术假设下，我们获得了具有很高概率的最小本征值的相似下限。我们将结果应用于两个实用的方案 - \ emph {Model Selection}和\ Emph {clustering}在线性匪徒中。对于模型选择，我们表明一种基于时期的线性匪徒算法以我们的新光谱结合，以时代数量的速率指数适应了真实的模型复杂性。对于聚类，我们考虑了一个多代理框架，通过利用光谱结果，我们可以在其中显示不需要强制探索 - 代理可以运行线性匪徒算法并立即估算其基本参数，从而引起低遗憾。

We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $Ω(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题