具有拓扑约束的多目标政策梯度

论文标题

具有拓扑约束的多目标政策梯度

Multi-Objective Policy Gradients with Topological Constraints

论文作者

Wray, Kyle Hollins, Tiomkin, Stas, Kochenderfer, Mykel J., Abbeel, Pieter

论文摘要

编码有序的顺序约束的多目标优化模型为建模各种具有挑战性的问题提供了一种解决方案，包括编码偏好，建模课程和执行安全措施。最近开发的拓扑马尔可夫决策过程理论（TMDP）捕获了离散状态和行动的情况。在这项工作中，我们通过制定，证明和实施TMDP的策略梯度定理，将TMDP扩展到连续的空间和未知过渡动力学。该理论结果可以创建使用功能近似器的TMDP学习算法，并可以推广现有的深入强化学习（DRL）方法。具体而言，我们通过简单的近端策略优化（PPO）算法的简单扩展为TMDPS中的策略梯度提供了一种新算法。我们在现实世界中的多目标导航问题上证明了这一点，并在模拟和真实机器人中对目标进行任意排序。

Multi-objective optimization models that encode ordered sequential constraints provide a solution to model various challenging problems including encoding preferences, modeling a curriculum, and enforcing measures of safety. A recently developed theory of topological Markov decision processes (TMDPs) captures this range of problems for the case of discrete states and actions. In this work, we extend TMDPs towards continuous spaces and unknown transition dynamics by formulating, proving, and implementing the policy gradient theorem for TMDPs. This theoretical result enables the creation of TMDP learning algorithms that use function approximators, and can generalize existing deep reinforcement learning (DRL) approaches. Specifically, we present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题