平均奖励马尔可夫决策过程的批处理政策学习

论文标题

平均奖励马尔可夫决策过程的批处理政策学习

Batch Policy Learning in Average Reward Markov Decision Processes

论文作者

Liao, Peng, Qi, Zhengling, Wan, Runzhe, Klasnja, Predrag, Murphy, Susan

论文摘要

我们考虑在无限地平线马尔可夫决策过程中考虑批处理（离线）政策学习问题。在移动健康应用程序的推动下，我们专注于学习一项最大化长期平均奖励的政策。我们为平均奖励提出了一个双重稳健的估计器，并表明它可以达到半参数效率。此外，我们开发了一种优化算法来计算参数化随机策略类中的最佳策略。估计政策的绩效是由政策类中最佳平均奖励与估计政策的平均奖励之间的差异来衡量的，我们建立了有限的样本遗憾保证。该方法的性能通过模拟研究和对促进体育活动的移动健康研究的分析进行了说明。

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题