论文标题

基于策略的原始偶对二重方法,用于减少方差的CMDP

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

论文作者

Ying, Donghao, Guo, Mengzi Amy, Lee, Hyunin, Ding, Yuhao, Lavaei, Javad, Shen, Zuo-Jun Max

论文摘要

我们研究了凹入的马尔可夫决策过程(凹入CMDP),其中目标和约束都被定义为国家行动占用度量的凹功能。我们提出了降低差异的原始偶尔策略梯度算法(VR-PDPG),该算法通过策略梯度上升来更新原始变量,并通过投影的次级梯度下降来更新二变量。尽管添加性结构的丧失和问题的非循环特性带来了挑战,但我们通过利用一种隐藏的凹陷形式来确定VR-PDPG的全球融合。在确切的环境中,我们证明了平均最佳差距和约束违规的$ O(T^{ - 1/3})$收敛率,这进一步改善了$ O(T^{ - 1/2})$,在占用度量度强的目标下。在基于示例的设置中,我们证明VR-PDPG达到了$ \ wideTilde {o}(ε^{ - 4})$ $ε$ -Global-Global最优性的样本复杂性。此外,通过将悲观术语减少纳入约束,我们表明VR-PDPG可以在不损害最佳差距的收敛速度的情况下达到零约束违规。最后,我们通过数值实验来验证方法的有效性。

We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源