论文标题

奖励素描上下文批处理土匪

Reward Imputation with Sketching for Contextual Batched Bandits

论文作者

Zhang, Xiao, Shao, Ninglu, Si, Zihua, Xu, Jun, Wang, Wenhan, Su, Hanjing, Wen, Ji-Rong

论文摘要

上下文批处理的匪徒(CBB)是在每个情节结束时从环境中观察到的一批奖励的设置,但是未执行的动作的奖励却没有观察到,从而产生了部分信息反馈。 CBB的现有方法通常忽略了非执行动作的回报,从而导致反馈信息未实现。在本文中,我们提出了一种有效的方法,称为“草图策略更新(SPUIR)”(SPUIR),该方法使用草图完成了未观察到的奖励,该奖励近似于完整信息反馈。我们将奖励归咎于归纳的正规脊回归问题,该问题捕获了执行和非执行动作的反馈机制。为了降低时间复杂性,我们使用随机素描解决回归问题。我们证明,与没有奖励的方法相比,我们的方法以可控制的偏见和差异更小。此外,我们的方法对最佳政策感到遗憾。我们还提出了两个扩展名,一个汇率的版本和一个非线性奖励的版本,使我们的方法更加实用。实验结果表明,SPUIR在合成,公共基准和现实世界中的最先进的基准都优于最先进的基线。

Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode, but the rewards of the non-executed actions are unobserved, resulting in partial-information feedback. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. In this paper, we propose an efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching, which approximates the full-information feedbacks. We formulate reward imputation as an imputation regularized ridge regression problem that captures the feedback mechanisms of both executed and non-executed actions. To reduce time complexity, we solve the regression problem using randomized sketching. We prove that our approach achieves an instantaneous regret with controllable bias and smaller variance than approaches without reward imputation. Furthermore, our approach enjoys a sublinear regret bound against the optimal policy. We also present two extensions, a rate-scheduled version and a version for nonlinear rewards, making our approach more practical. Experimental results show that SPUIR outperforms state-of-the-art baselines on synthetic, public benchmark, and real-world datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源