论文标题

零和马尔可夫游戏中策略优化的最后近期收敛速度更快

Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

论文作者

Cen, Shicong, Chi, Yuejie, Du, Simon S., Xiao, Lin

论文摘要

多代理增强学习(MARL)(多个代理都学会在共享的动态环境中进行交互)遍及广泛的关键应用程序。尽管在理解单代理RL中政策优化方法的全球融合方面取得了重大进展,但在MARL设置中设计和分析了有效的政策优化算法带来了重大挑战,不幸的是,现有理论仍然高度不足。在本文中,我们专注于竞争性多代理RL的最基本环境,即零和马尔可夫游戏,并在Infinite-Horizo​​n折扣设置和有限的Horizo​​n-Horizo​​n情节环境中研究平衡查找算法。我们提出了一种具有对称更新的单环策略优化方法,其中策略是通过熵调查的乐观乘法权重更新(OMWU)方法更新的,并且该值在较慢的时间表上更新。我们表明,在全信息表格设置中,所提出的方法实现了正则化问题的定量响应平衡的有限时间的最后近期线性收敛,这可以通过控制正规化的量来转化为均方根二介质的最后一际融合到NASH平衡。我们的融合结果改善了最著名的迭代复杂性,并可以更好地了解竞争性马尔可夫游戏中的政策优化。

Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to interact in a shared dynamic environment -- permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源