用于学习在多基因增强学习中学习的政策梯度算法

论文标题

用于学习在多基因增强学习中学习的政策梯度算法

A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning

论文作者

Kim, Dong-Ki, Liu, Miao, Riemer, Matthew, Sun, Chuangchuang, Abdulhai, Marwa, Habibi, Golnaz, Lopez-Cot, Sebastian, Tesauro, Gerald, How, Jonathan P.

论文摘要

多基础强化学习中的一个基本挑战是在共同的环境中与其他同时学习的代理人一起学习有益的行为。特别是，由于其他代理的政策变化，每个代理都认为环境有效地是非平稳的。此外，每个代理本身都在不断学习，从而导致自然的非平稳性在遇到的经验分布中。在本文中，我们提出了一种新颖的元元素策略梯度定理，该定理直接考虑了多种学习设置固有的非平稳政策动态。这是通过建模我们的梯度更新来实现的，以考虑代理自己的非平稳政策动态和环境中其他代理的非平稳政策动态。我们表明，我们的理论基础方法为多基金会学习问题提供了一种普遍的解决方案，该方法固有地构成了先前的艺术状态在该主题上的所有关键方面。我们在多种基准测试套件上测试我们的方法，并证明了与新代理相比，它们比各种混合激励，竞争性和合作领域的基线方法更有效地适应新代理。

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other simultaneously learning agents. In particular, each agent perceives the environment as effectively non-stationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural non-stationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accounts for the non-stationary policy dynamics inherent to multiagent learning settings. This is achieved by modeling our gradient updates to consider both an agent's own non-stationary policy dynamics and the non-stationary policy dynamics of other agents in the environment. We show that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently comprises all key aspects of previous state of the art approaches on this topic. We test our method on a diverse suite of multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than baseline methods across the full spectrum of mixed incentive, competitive, and cooperative domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题