论文标题

部分可观测时空混沌系统的无模型预测

Anytime-valid off-policy inference for contextual bandits

论文作者

Waudby-Smith, Ian, Wu, Lili, Ramdas, Aaditya, Karampatziakis, Nikos, Mineiro, Paul

论文摘要

上下文匪徒算法是用于医疗保健和科技行业积极顺序实验的无处不在的工具。它们涉及在线学习算法,这些算法随着时间的推移而适应地学习政策,以映射观察到的上下文$ x_t $ to Actions $ a_t $,以最大程度地提高随机奖励$ r_t $。这种适应性提出了有趣但棘手的统计推论问题,尤其是反事实问题:例如,估计假设政策的特性通常是有意义的,该策略与用于收集数据的记录策略不同,这是一个称为“``非政策评估'''(OPE)的问题。使用现代的Martingale技术,我们为OPE推断提供了一个综合框架,即放宽过去作品中不必要的条件,从理论和经验上都显着改善了它们。重要的是,当原始实验仍在运行时(即,不一定是事后),当记录策略本身可能在变化(由于学习),即使上下文分布是高度依赖的时间序列(例如它们随着时间的推移而流动),我们的方法可以采用。更具体地说,我们得出了OPE感兴趣的各种功能的置信序列。这些包括随时间变化的非上政策平均奖励值的双重健壮的奖励,但还包括置信度奖励分布的整个累积分布函数的信心频段。我们所有的方法(a)在任意停止时间(b)仅做出非参数假设,(c)不需要重要的权重均匀地界限,如果是,我们不需要知道这些界限,并且(d)适应我们估计器的经验差异。总而言之,我们的方法可以使用自适应收集的上下文匪徒数据来启用任何时间 - valid非政策推理。

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源