合作的多机构强化学习，有部分观察

论文标题

合作的多机构强化学习，有部分观察

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

论文作者

Zhang, Yan, Zavlanos, Michael M.

论文摘要

在本文中，我们提出了一种用于多机构增强学习（MARL）的分布式零级策略优化方法。现有的MALL算法通常认为每个代理都可以观察网络中所有其他代理的状态和行动。这在大规模问题中可能是不切实际的，在大规模的问题中，与多跳邻居共享国家和行动信息可能会导致大量的沟通开销。提出的零订单策略优化方法的优点是，它允许代理商使用对仅取决于部分状态和行动信息的全局累积奖励来更新其本地策略功能所需的本地策略梯度，仅使用共识获得。具体而言，为了计算本地策略梯度，我们开发了一种新的分布式零订单策略梯度估计器，该估计值依赖于单点残留反馈，与现有的零级估计器相比，该估计器也依赖于单点反馈，显着降低了策略梯度估计的方差，以这种方式提高了策略估计的改善。我们表明，拟议的分布式零订单策略优化方法，具有常数步骤尺寸，将其收敛到策略的附近，这是全球目标函数的固定点。该社区的大小取决于代理的学习率，探索参数以及用于计算全局累积奖励的本地估计的共识步骤的数量。此外，我们提供的数值实验表明，与其他现有的单点估计器相比，我们的新的零阶策略梯度估计器比样本效率更高。

In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function. The size of this neighborhood depends on the agents' learning rates, the exploration parameters, and the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题