两个时间尺度（天然）参与者 - 批评算法的非反应收敛分析

论文标题

两个时间尺度（天然）参与者 - 批评算法的非反应收敛分析

Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

论文作者

Xu, Tengyu, Wang, Zhe, Liang, Yingbin

论文摘要

作为增强学习算法的一种重要类型，参与者批评（AC）和自然参与者（NAC）算法通常以两种方式执行以找到最佳策略。在第一个嵌套环设计中，Actor的一个策略更新之后是评论家的价值功能更新的整个循环，最近确定了此类AC和NAC算法的有限样本分析。第二个两个时间尺度的设计，其中演员和评论家同时更新，但学习率不同，其调音参数比嵌套环设计少得多，因此实现了很容易实现。尽管已经显示出两个时间级的AC和NAC在文献中收敛，但尚未确定有限样本的收敛速率。在本文中，我们在马尔可夫采样下提供了第一个这种非质子收敛速率，并提供了具有一般政策类别近似的演员。我们表明，两个时间尺度的AC需要按$ \ MATHCAL {O}（ε^{ - 2.5} \ log^3（ε^{ - 1}）$达到的总体样本复杂性，才能达到$ε$ - $ -ACCCPURATE SENTARAY POINT，并且需要两个时间尺度的NAC样品复杂性。 $ \ MATHCAL {o}（ε^{ - 4} \ log^2（ε^{ - 1}））$以达到$ε$ - $ -CACCURATE的全局最佳点。我们开发了通过动态改变马尔可夫抽样以及通过动态变化的基本功能和过渡内核来分析线性评论家的收敛速率，开发了用于界定演员偏差误差的新颖技术。

As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-2.5}\log^3(ε^{-1}))$ to attain an $ε$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-4}\log^2(ε^{-1}))$ to attain an $ε$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.

下载PDF全文

下载文献需遵守相关版权规定

论文标题