动态线性匪徒

论文标题

动态线性匪徒

Dynamical Linear Bandits

论文作者

Mussi, Marco, Metelli, Alberto Maria, Restelli, Marcello

论文摘要

在许多真实世界的顺序决策问题中，动作并未立即反映反馈，并在长期范围内传播其效果。例如，在在线广告中，投资平台会产生瞬时意识的提高，但是实际的奖励（即转换）可能会在将来发生很远。此外，转换是否取决于：意识增长的速度，消失的效果以及对其他广告平台的协同作用或干扰。先前的工作已经调查了多军匪徒框架，其可能是延迟和汇总反馈的可能性，而没有特定的结构，即动作如何在将来传播，而无视可能的动态效应。在本文中，我们介绍了一种新颖的设置，即动态线性土匪（DLB），这是以隐藏状态为特征的线性匪徒的扩展。当执行动作时，学习者会观察到嘈杂的奖励，其平均值是隐藏状态和动作的线性函数。然后，隐藏状态根据线性动力学发展，也受到执行动作的影响。我们首先介绍设置，讨论最佳政策的概念，并获得预期的遗憾下限。然后，我们提供一种乐观的遗憾最小化算法，动态线性的上限限制（DYNLIN-UCB），这使人们对订单的遗憾$ \ widetilde {\ Mathcal {\ Mathcal {o}} \ big（\ frac {d \ sqrt {d \ sqrt {t}}}} $ \overlineρ$是对系统稳定性的衡量标准，而$ d $是操作向量的维度。最后，我们对合成环境和现实数据进行了数值验证，以显示Dynlin-UCB与几个基线相比的有效性。

In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order $\widetilde{\mathcal{O}} \Big( \frac{d \sqrt{T}}{(1-\overlineρ)^{3/2}} \Big)$, where $\overlineρ$ is a measure of the stability of the system, and $d$ is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题