论文标题
Coindice:非政策置信区间估计
CoinDICE: Off-Policy Confidence Interval Estimation
论文作者
论文摘要
我们研究了强化学习中的高信任行为反应型非上政策评估,该目标是仅访问未知行为策略收集的静态体验数据集,以估算目标策略价值的置信区间。从$ q $函数的线性程序公式的函数空间嵌入开始,我们获得了具有广义估计方程约束的优化问题。通过将广义的经验可能性方法应用于由此产生的拉格朗日,我们提出了Coindice,这是一种用于计算置信区间的新型有效算法。从理论上讲,我们证明所获得的置信区间在渐近样本和有限样本方案中都是有效的。从经验上讲,我们以各种基准显示表明,置信区间估计值比现有方法更紧密,更准确。
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.