论文标题
关于政策梯度方法与一般随机游戏中NASH平衡的收敛性
On the convergence of policy gradient methods to Nash equilibria in general stochastic games
论文作者
论文摘要
在随机游戏中学习是一个众所周知的困难问题,因为除了彼此的战略决策外,玩家还必须与游戏本身随着时间的推移发展,可能以非常复杂的方式发展。因此,除了在特定的游戏类别(例如潜在的或两人玩游戏,零 - 零游戏)外,人们对流行学习算法的收敛属性(例如策略梯度及其变体)。鉴于此,我们研究了策略梯度方法在二阶固定(SOS)的NASH平衡策略方面的长期行为,从某种意义上说,与优化中使用的充分条件类型相似。我们的第一个结果是,SOS策略在本地吸引了很高的可能性,我们表明,策略梯度轨迹具有梯度估计值,该梯度估计值是$ \ Mathcal {o}(1/\ sqrt {n} {n})$ demance $ demance-$ demance $ demance-$ demance $ demance-Squared contrygence如果方法的速度融合率。随后,专门研究确定性的NASH政策类别,我们表明该速率可以大大提高,实际上,政策梯度方法在这种情况下会融合有限数量的迭代次数。
Learning in stochastic games is a notoriously difficult problem because, in addition to each other's strategic decisions, the players must also contend with the fact that the game itself evolves over time, possibly in a very complicated manner. Because of this, the convergence properties of popular learning algorithms - like policy gradient and its variants - are poorly understood, except in specific classes of games (such as potential or two-player, zero-sum games). In view of this, we examine the long-run behavior of policy gradient methods with respect to Nash equilibrium policies that are second-order stationary (SOS) in a sense similar to the type of sufficiency conditions used in optimization. Our first result is that SOS policies are locally attracting with high probability, and we show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $\mathcal{O}(1/\sqrt{n})$ distance-squared convergence rate if the method's step-size is chosen appropriately. Subsequently, specializing to the class of deterministic Nash policies, we show that this rate can be improved dramatically and, in fact, policy gradient methods converge within a finite number of iterations in that case.