最好的多代理增强学习

论文标题

最好的多代理增强学习

Off-Beat Multi-Agent Reinforcement Learning

论文作者

Qiu, Wei, Wang, Weixun, Wang, Rundong, An, Bo, Hu, Yujing, Obraztsova, Svetlana, Rabinovich, Zinovi, Hao, Jianye, Chen, Yingfeng, Fan, Changjie

论文摘要

我们研究了普遍存在的动作，即所有动作都具有预设执行持续时间的环境中，我们研究了无模型的多代理增强学习（MARL）。在执行持续时间内，环境变化受到动作执行的影响，但不同步。在许多现实世界中，这种设置无处不在。但是，大多数MAL方法都假定推断后立即执行动作，这通常是不现实的，并且可能导致多机构协调与伴随动作的灾难性失败。为了填补这一空白，我们为MARL开发了一个算法的算法框架。然后，我们为无模型的MARL算法提出了一种新颖的情节记忆，Legem。 Legem通过利用代理人的个人经历来建立代理商的情节记忆。通过我们的新型奖励再分配计划提出了挑战性的时间信用分配问题，从而减轻了非马克维亚奖励问题，从而促进了多机构学习。我们在各种多代理场景上评估了Legem，其中包括猎犬游戏，采石场游戏，造林游戏和Starcraft II微型管理任务。经验结果表明，LegeM显着提高了多代理的协调，并提高了领先的绩效并提高了样本效率。

We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题