Markov决策过程的梯度感知搜索算法

论文标题

Markov决策过程的梯度感知搜索算法

A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes

论文作者

Khairy, Sami, Balaprakash, Prasanna, Cai, Lin X.

论文摘要

有限限制的马尔可夫决策过程（CMDP）的规范解决方案方法，其目的是基于凸线线性计划，以最大化预期的无限 - 霍森折扣奖励。在此简介中，我们首先证明有限CMDP的双线性程序中的优化目标是针对Lagrange惩罚乘数的零件线性凸功能（PWLC）。接下来，我们提出了一种新型的两级梯度感知搜索（气）算法，该算法利用PWLC结构找到有限CMDP的最佳状态值函数和Lagrange惩罚乘数。所提出的算法应用于两个随机控制问题，具有约束：网格世界中的机器人导航和基于太阳能的无人机（UAV）基于基于太阳能的无线网络管理。我们从经验上比较了所提出的气体算法的收敛性能与二进制搜索（BS），Lagrangian Primal-Dual-Dual优化（PDO）和线性编程（LP）。与基准算法相比，结果表明，提出的气体算法会更快地收敛到最佳溶液，不需要高参数调整，并且对Lagrange惩罚乘数的初始化不敏感。

The canonical solution methodology for finite constrained Markov decision processes (CMDPs), where the objective is to maximize the expected infinite-horizon discounted rewards subject to the expected infinite-horizon discounted costs constraints, is based on convex linear programming. In this brief, we first prove that the optimization objective in the dual linear program of a finite CMDP is a piece-wise linear convex function (PWLC) with respect to the Lagrange penalty multipliers. Next, we propose a novel two-level Gradient-Aware Search (GAS) algorithm which exploits the PWLC structure to find the optimal state-value function and Lagrange penalty multipliers of a finite CMDP. The proposed algorithm is applied in two stochastic control problems with constraints: robot navigation in a grid world and solar-powered unmanned aerial vehicle (UAV)-based wireless network management. We empirically compare the convergence performance of the proposed GAS algorithm with binary search (BS), Lagrangian primal-dual optimization (PDO), and Linear Programming (LP). Compared with benchmark algorithms, it is shown that the proposed GAS algorithm converges to the optimal solution faster, does not require hyper-parameter tuning, and is not sensitive to initialization of the Lagrange penalty multiplier.

下载PDF全文

下载文献需遵守相关版权规定

论文标题