基于策略的原始偶对二重方法，用于减少方差的CMDP

论文标题

基于策略的原始偶对二重方法，用于减少方差的CMDP

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

论文作者

Ying, Donghao, Guo, Mengzi Amy, Lee, Hyunin, Ding, Yuhao, Lavaei, Javad, Shen, Zuo-Jun Max

论文摘要

我们研究了凹入的马尔可夫决策过程（凹入CMDP），其中目标和约束都被定义为国家行动占用度量的凹功能。我们提出了降低差异的原始偶尔策略梯度算法（VR-PDPG），该算法通过策略梯度上升来更新原始变量，并通过投影的次级梯度下降来更新二变量。尽管添加性结构的丧失和问题的非循环特性带来了挑战，但我们通过利用一种隐藏的凹陷形式来确定VR-PDPG的全球融合。在确切的环境中，我们证明了平均最佳差距和约束违规的$ O（T^{ - 1/3}）$收敛率，这进一步改善了$ O（T^{ - 1/2}）$，在占用度量度强的目标下。在基于示例的设置中，我们证明VR-PDPG达到了$ \ wideTilde {o}（ε^{ - 4}）$ $ε$ -Global-Global最优性的样本复杂性。此外，通过将悲观术语减少纳入约束，我们表明VR-PDPG可以在不损害最佳差距的收敛速度的情况下达到零约束违规。最后，我们通过数值实验来验证方法的有效性。

We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题