论文标题
终身强盗优化:没有事先,也不后悔
Lifelong Bandit Optimization: No Prior and No Regret
论文作者
论文摘要
机器学习算法通常反复地应用于具有相似结构的问题。我们专注于解决一系列Bandit优化任务,并开发Libo,这是一种通过从过去的经验中学习来适应环境的算法,并在此过程中变得更有效。我们假设内核是未知但在所有任务中共享的内核结构。 Libo依次将近似值的内核近似于真实内核,并通过最新的内核估计来解决传入的任务。我们的算法可以与任何内核或线性匪徒算法配对,并保证Oracle最佳性能,这意味着随着更多的任务解决,Libo对每个任务的遗憾都融合到了Bandit算法的遗憾,并凭借Oracle对真正的内核的了解。自然,如果搭配均匀的强盗算法,Libo会产生终身后悔。我们还表明,从每个任务中直接访问数据并不是要获得Sublerear后悔的必要条件。我们提出了F-Libo,该F-Libo以联邦的方式解决了终身问题。
Machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.