贪婪GQ的有限时间误差范围

论文标题

贪婪GQ的有限时间误差范围

Finite-Time Error Bounds for Greedy-GQ

论文作者

Wang, Yue, Zhou, Yi, Zou, Shaofeng

论文摘要

具有线性函数近似的贪婪GQ，最初在\ cite {Maei2010toward}中提出，是一种基于价值的基础外算法，用于增强增强学习中的最佳控制，并且具有非线性的两个时间尺度结构，具有非convex目标函数。本文开发了其最紧密的有限时间误差界限。我们表明，贪婪的GQ算法的收敛性与$ \ Mathcal {o}（{1}/{\ sqrt {t}}} $下的$ \ Mathcal {o}（{1}/{\ sqrt {t}}）$，\ \ \ setter和$ \ Mathcal {O}我们进一步设计了使用嵌套环方法的香草贪婪-GQ算法的变体，并表明其样品复杂性为$ \ Mathcal {o}（{\ log（1/ε）ε^{ - 2}}）$，与vanilla Greedy Greedy-GQ相匹配。尽管在两个时间尺度更新中，我们的有限时间误差界限与一般光滑非凸优化问题的一般平滑非凸优化问题的随机梯度下降算法之一匹配。我们的有限样本分析提供了有关选择步骤尺寸以在实践中更快收敛的理论指南，并提出了融合率和获得政策的质量之间的权衡。我们的技术为非凸的有限样本分析提供了一种一般方法，可以分析两个基于时尺的增强算法。

Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal{O}({1}/{\sqrt{T}})$ under the i.i.d.\ setting and $\mathcal{O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal{O}({\log(1/ε)ε^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题