根据批处理更新策略，来自依赖样本的Bandit算法的非政策评估

论文标题

根据批处理更新策略，来自依赖样本的Bandit算法的非政策评估

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

论文作者

Kato, Masahiro, Kaneko, Yusuke

论文摘要

非政策评估（OPE）的目的是使用通过行为策略获得的历史数据评估新政策。但是，由于上下文匪徒算法基于过去的观测值更新了策略，因此样本不是独立的，并且分布相同（i.i.d.）。本文通过为依赖样品构建估计器来解决此问题。在数据生成过程中，我们不假定策略的收敛性，但是策略使用了在某个时期内选择诉讼的相同条件概率。然后，我们得出评估策略价值的渐近正常估计器。作为我们方法的另一个优点，基于批处理的方法同时解决了不足的支持问题。使用基准和现实世界数据集，我们通过实验确认所提出方法的有效性。

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples. In the data-generating process, we do not assume the convergence of the policy, but the policy uses the same conditional probability of choosing an action during a certain period. Then, we derive an asymptotically normal estimator of the value of an evaluation policy. As another advantage of our method, the batch-based approach simultaneously solves the deficient support problem. Using benchmark and real-world datasets, we experimentally confirm the effectiveness of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题