解释针对受限的多基因增强学习的原始偶算法

论文标题

解释针对受限的多基因增强学习的原始偶算法

Interpreting Primal-Dual Algorithms for Constrained Multiagent Reinforcement Learning

论文作者

Tabas, Daniel, Zamzam, Ahmed S., Zhang, Baosen

论文摘要

随着MARL算法在现实世界系统中找到新的应用程序，从能量系统到无人机群。大多数C-MARL算法都使用原始的双重方法来通过添加奖励的惩罚功能来执行约束。在本文中，我们研究了该罚款术语对MARL问题的结构性影响。首先，我们表明将约束功能用作惩罚的标准做法导致安全性较弱。但是，通过对罚款术语进行简单的修改，我们可以实施有意义的概率（风险的机会和条件价值）约束。其次，我们量化了惩罚项对价值函数的影响，从而发现了改进的价值估计程序。我们使用这些见解来提出受约束的多种优势演员评论家（C-MAA2C）算法。在一个简单的约束多基因环境中的模拟确认，我们在概率约束方面对原始偶的方法的重新解释是有效的，并且我们提出的价值估计会加速融合到安全的联合政策。

Constrained multiagent reinforcement learning (C-MARL) is gaining importance as MARL algorithms find new applications in real-world systems ranging from energy systems to drone swarms. Most C-MARL algorithms use a primal-dual approach to enforce constraints through a penalty function added to the reward. In this paper, we study the structural effects of this penalty term on the MARL problem. First, we show that the standard practice of using the constraint function as the penalty leads to a weak notion of safety. However, by making simple modifications to the penalty term, we can enforce meaningful probabilistic (chance and conditional value at risk) constraints. Second, we quantify the effect of the penalty term on the value function, uncovering an improved value estimation procedure. We use these insights to propose a constrained multiagent advantage actor critic (C-MAA2C) algorithm. Simulations in a simple constrained multiagent environment affirm that our reinterpretation of the primal-dual method in terms of probabilistic constraints is effective, and that our proposed value estimate accelerates convergence to a safe joint policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题