好奇心驱动的多代理探索与混合目标

论文标题

好奇心驱动的多代理探索与混合目标

Curiosity-Driven Multi-Agent Exploration with Mixed Objectives

论文作者

Reyes, Roben Delos, Son, Kyunghwan, Jung, Jinhwan, Kang, Wan Ju, Yi, Yung

论文摘要

内在的奖励已越来越多地用于减轻单药增强学习中稀疏的奖励问题。这些内在的奖励鼓励代理商寻找新颖的体验，尽管缺乏外部奖励，但指导代理人充分探索环境。好奇心驱动的探索是一种简单而有效的方法，将这种新颖性量化为代理商好奇模块的预测误差，这是一种内部神经网络，鉴于其当前状态和动作，经过训练，可以预测该代理的下一个状态。但是，我们在这里表明，在稀疏奖励合作多代理环境中，使用这种好奇心驱动的方法来指导探索并不会始终导致结果改善。仅考虑个人或集体新颖性的好奇心驱动探索的直接多代理扩展，因此，它们没有提供独特但协作的内在奖励信号，这对于在合作多代任务中学习至关重要。在这项工作中，我们提出了一种由好奇心驱动的多代理探索方法，该方法的目的是激励代理商以单独和集体新颖的方式探索环境。首先，我们开发了一个两头好奇的模块，该模块经过训练，可以预测相应的代理在第一头中的下一个观察结果，并在第二个头部预测下一个关节观察。其次，我们将内在奖励公式设计为这个好奇心模块的个体和关节预测错误的总和。我们从经验上表明，好奇模块架构和内在奖励配方的组合比基线方法更有效地指导了多代理探索，从而为合作导航环境中MARL算法提供了最佳性能，并且稀疏的重新启动。

Intrinsic rewards have been increasingly used to mitigate the sparse reward problem in single-agent reinforcement learning. These intrinsic rewards encourage the agent to look for novel experiences, guiding the agent to explore the environment sufficiently despite the lack of extrinsic rewards. Curiosity-driven exploration is a simple yet efficient approach that quantifies this novelty as the prediction error of the agent's curiosity module, an internal neural network that is trained to predict the agent's next state given its current state and action. We show here, however, that naively using this curiosity-driven approach to guide exploration in sparse reward cooperative multi-agent environments does not consistently lead to improved results. Straightforward multi-agent extensions of curiosity-driven exploration take into consideration either individual or collective novelty only and thus, they do not provide a distinct but collaborative intrinsic reward signal that is essential for learning in cooperative multi-agent tasks. In this work, we propose a curiosity-driven multi-agent exploration method that has the mixed objective of motivating the agents to explore the environment in ways that are individually and collectively novel. First, we develop a two-headed curiosity module that is trained to predict the corresponding agent's next observation in the first head and the next joint observation in the second head. Second, we design the intrinsic reward formula to be the sum of the individual and joint prediction errors of this curiosity module. We empirically show that the combination of our curiosity module architecture and intrinsic reward formulation guides multi-agent exploration more efficiently than baseline approaches, thereby providing the best performance boost to MARL algorithms in cooperative navigation environments with sparse rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题