近端梯度时间差异学习：多项式样本复杂性稳定的增强学习

论文标题

近端梯度时间差异学习：多项式样本复杂性稳定的增强学习

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

论文作者

Liu, Bo, Gemp, Ian, Ghavamzadeh, Mohammad, Liu, Ji, Mahadevan, Sridhar, Petrik, Marek

论文摘要

在本文中，我们介绍了近端的时间差异学习，该学习提供了设计和分析真正随机梯度时间差异学习算法的原则方法。我们展示了如何正式得出梯度TD（GTD）增强学习方法，而不是从其原始目标函数开始，而是从以前尝试过的，而是从原始的双鞍点目标函数开始。我们还进行了鞍点误差分析，以获得其性能的有限样本界限。对这类算法的先前分析使用随机近似技术来证明渐近收敛性，并且不提供任何有限样本分析。我们还提出了一种称为GTD2-MP的加速算法，该算法使用近端``Mirror Maps''产生提高的收敛速率。我们的理论分析结果表明，算法的GTD家族是可比性的，并且由于其线性复杂性，因此实际上比现有的最小二乘TD方法更受欢迎。我们提供了实验结果，表明我们加速梯度TD方法的性能提高。

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal ``mirror maps'' to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题