论文标题
Markov决策过程的梯度感知搜索算法
A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes
论文作者
论文摘要
有限限制的马尔可夫决策过程(CMDP)的规范解决方案方法,其目的是基于凸线线性计划,以最大化预期的无限 - 霍森折扣奖励。在此简介中,我们首先证明有限CMDP的双线性程序中的优化目标是针对Lagrange惩罚乘数的零件线性凸功能(PWLC)。接下来,我们提出了一种新型的两级梯度感知搜索(气)算法,该算法利用PWLC结构找到有限CMDP的最佳状态值函数和Lagrange惩罚乘数。所提出的算法应用于两个随机控制问题,具有约束:网格世界中的机器人导航和基于太阳能的无人机(UAV)基于基于太阳能的无线网络管理。我们从经验上比较了所提出的气体算法的收敛性能与二进制搜索(BS),Lagrangian Primal-Dual-Dual优化(PDO)和线性编程(LP)。与基准算法相比,结果表明,提出的气体算法会更快地收敛到最佳溶液,不需要高参数调整,并且对Lagrange惩罚乘数的初始化不敏感。
The canonical solution methodology for finite constrained Markov decision processes (CMDPs), where the objective is to maximize the expected infinite-horizon discounted rewards subject to the expected infinite-horizon discounted costs constraints, is based on convex linear programming. In this brief, we first prove that the optimization objective in the dual linear program of a finite CMDP is a piece-wise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we propose a novel two-level Gradient-Aware Search (GAS) algorithm which exploits the PWLC structure to find the optimal state-value function and Lagrange penalty multipliers of a finite CMDP. The proposed algorithm is applied in two stochastic control problems with constraints: robot navigation in a grid world and solar-powered unmanned aerial vehicle (UAV)-based wireless network management. We empirically compare the convergence performance of the proposed GAS algorithm with binary search (BS), Lagrangian primal-dual optimization (PDO), and Linear Programming (LP). Compared with benchmark algorithms, it is shown that the proposed GAS algorithm converges to the optimal solution faster, does not require hyper-parameter tuning, and is not sensitive to initialization of the Lagrange penalty multiplier.