自然政策梯度方法的几何形状和收敛性

论文标题

自然政策梯度方法的几何形状和收敛性

Geometry and convergence of natural policy gradient methods

论文作者

Müller, Johannes, Montúfar, Guido

论文摘要

我们研究了无限 - 摩托克折现马尔可夫决策过程中几种自然政策梯度（NPG）方法的收敛性，并具有常规的策略参数。对于各种NPG和奖励功能，我们表明，国家行动空间中的轨迹是梯度流相对于黑石几何形状的解决方案，基于我们获得全球收敛保证和收敛速率。特别是，我们通过观察到有条件熵和熵的黑森几何形状而引起的，与Kakade和Morimura提出的指标以及合着者提出的指标以及合着者提出的指标显示了线性收敛的线性收敛。此外，我们获得了由其他凸函数（如对数棒）产生的黑森几何形状的均方根收敛速率。最后，如果将NPG定义为正常化程序的Hessian几何形状，则将带正则奖励的离散时间NPG方法解释为不精确的牛顿方法。这会产生这些方法的局部二次收敛速率，以等于惩罚强度。

We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the penalization strength.

下载PDF全文

下载文献需遵守相关版权规定

论文标题