MCMARL：通过分类分布的多代理增强学习的参数化值函数

论文标题

MCMARL：通过分类分布的多代理增强学习的参数化值函数

MCMARL: Parameterizing Value Function via Mixture of Categorical Distributions for Multi-Agent Reinforcement Learning

论文作者

Zhao, Jian, Yang, Mingyu, Zhao, Youpeng, Hu, Xunhan, Zhou, Wengang, Zhu, Jiangcheng, Li, Houqiang

论文摘要

在合作的多机构任务中，一组代理团队通过采取行动，获得团队奖励并观察下一个州，共同与环境互动。在相互作用期间，环境和奖励的不确定性将不可避免地引起长期回报的随机性，并且随机性会随着代理数量的增加而加剧。但是，大多数现有的基于价值的多代理增强学习（MARL）方法忽略了这种随机性，该方法仅对单个代理商和团队的Q值对Q值的期望进行建模。与使用长期收益的期望相比，最好通过通过分布来估算回报来直接对随机性进行建模。有了这一动机，这项工作从分布的角度提出了一个新颖的基于价值的MARL框架，\ emph {i。具体而言，我们将单个Q值和全局Q值与分类分布建模。为了整合分类分布，我们将五个基本操作定义在分布上，这允许将期望值函数分解方法（\ emph {e.g。}，vdn和qmix）概括到其MCMARL变体中。我们进一步证明，我们的MCMARL框架满足\ Emph {分布 - 个体 - global-max}（DIGM）原理，就分布的期望而言，这保证了全球Q值和单个Q值的关节和个体贪婪的行动之间的一致性。从经验上讲，我们在随机矩阵游戏和一组充满挑战的星际争霸II微管理任务上评估了McMarl，显示了我们框架的功效。

In cooperative multi-agent tasks, a team of agents jointly interact with an environment by taking actions, receiving a team reward and observing the next state. During the interactions, the uncertainty of environment and reward will inevitably induce stochasticity in the long-term returns and the randomness can be exacerbated with the increasing number of agents. However, such randomness is ignored by most of the existing value-based multi-agent reinforcement learning (MARL) methods, which only model the expectation of Q-value for both individual agents and the team. Compared to using the expectations of the long-term returns, it is preferable to directly model the stochasticity by estimating the returns through distributions. With this motivation, this work proposes a novel value-based MARL framework from a distributional perspective, \emph{i.e.}, parameterizing value function via \underline{M}ixture of \underline{C}ategorical distributions for MARL. Specifically, we model both individual Q-values and global Q-value with categorical distribution. To integrate categorical distributions, we define five basic operations on the distribution, which allow the generalization of expected value function factorization methods (\emph{e.g.}, VDN and QMIX) to their MCMARL variants. We further prove that our MCMARL framework satisfies \emph{Distributional-Individual-Global-Max} (DIGM) principle with respect to the expectation of distribution, which guarantees the consistency between joint and individual greedy action selections in the global Q-value and individual Q-values. Empirically, we evaluate MCMARL on both a stochastic matrix game and a challenging set of StarCraft II micromanagement tasks, showing the efficacy of our framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题