合作多代理增强学习数据包路由的奖励设计

论文标题

合作多代理增强学习数据包路由的奖励设计

Reward Design in Cooperative Multi-agent Reinforcement Learning for Packet Routing

论文作者

Mao, Hangyu, Gong, Zhibo, Xiao, Zhen

论文摘要

在合作的多代理增强学习（MARL）中，如何设计合适的奖励信号以加速学习和稳定融合是一个关键问题。全球奖励信号在没有区分贡献的情况下将相同的全球奖励分配给所有代理，而本地奖励信号仅基于个人行为为每个代理提供不同的本地奖励。两种奖励任务方法都有一些缺点：前者可能会鼓励懒惰的代理商，而后者可能会产生自私的代理商。在本文中，我们根据数据包路由环境研究了合作MAL中的奖励设计问题。首先，我们表明以上两个奖励信号容易产生次优政策。然后，受到一些观察和考虑因素的启发，我们设计了一些混合的奖励信号，这些信号是在学习更好的政策的现成。最后，我们将混合的奖励信号转变为自适应对应物，在我们的实验中取得了最佳结果。本文还讨论了其他奖励信号。由于奖励设计在RL中是一个非常基本的问题，尤其是在MARL中，我们希望MARL研究人员可以重新考虑其系统中使用的奖励。

In cooperative multi-agent reinforcement learning (MARL), how to design a suitable reward signal to accelerate learning and stabilize convergence is a critical problem. The global reward signal assigns the same global reward to all agents without distinguishing their contributions, while the local reward signal provides different local rewards to each agent based solely on individual behavior. Both of the two reward assignment approaches have some shortcomings: the former might encourage lazy agents, while the latter might produce selfish agents. In this paper, we study reward design problem in cooperative MARL based on packet routing environments. Firstly, we show that the above two reward signals are prone to produce suboptimal policies. Then, inspired by some observations and considerations, we design some mixed reward signals, which are off-the-shelf to learn better policies. Finally, we turn the mixed reward signals into the adaptive counterparts, which achieve best results in our experiments. Other reward signals are also discussed in this paper. As reward design is a very fundamental problem in RL and especially in MARL, we hope that MARL researchers can rethink the rewards used in their systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题