有效多代理深入学习的分配奖励估算

论文标题

有效多代理深入学习的分配奖励估算

Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning

论文作者

Hu, Jifeng, Sun, Yanchao, Chen, Hechang, Huang, Sili, piao, haiyin, Chang, Yi, Sun, Lichao

论文摘要

在实践中，包括机器人技术和自动驾驶，多代理强化学习吸引了越来越多的关注，因为它可以使用与环境相互作用产生的样本来探索最佳策略。但是，当我们想训练一个令人满意的模型时，高奖励不确定性仍然是一个问题，因为获得高质量的奖励反馈通常昂贵甚至是不可行的。为了解决这个问题，以前的方法主要集中于被动奖励校正。同时，最近的主动奖励估计方法已被证明是减少奖励不确定性效果的秘诀。在本文中，我们提出了一个新颖的分配奖励估计框架，用于有效的多代理增强学习（DRE-MARL）。我们的主要思想是设计用于稳定培训的多动分支奖励估计和政策加权奖励聚合。具体而言，我们设计了多动分支奖励估计，以模拟所有行动分支机构的奖励分布。然后，我们利用奖励聚合在培训期间获得稳定的更新信号。我们的直觉是，考虑行动的所有可能后果可能对学习政策有用。在有效性和鲁棒性方面，使用基准多代理方案证明了DRE-MARL的优势。

Multi-agent reinforcement learning has drawn increasing attention in practice, e.g., robotics and automatic driving, as it can explore optimal policies using samples generated by interacting with the environment. However, high reward uncertainty still remains a problem when we want to train a satisfactory model, because obtaining high-quality reward feedback is usually expensive and even infeasible. To handle this issue, previous methods mainly focus on passive reward correction. At the same time, recent active reward estimation methods have proven to be a recipe for reducing the effect of reward uncertainty. In this paper, we propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL). Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. Specifically, we design the multi-action-branch reward estimation to model reward distributions on all action branches. Then we utilize reward aggregation to obtain stable updating signals during training. Our intuition is that consideration of all possible consequences of actions could be useful for learning policies. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题