与历史相关的奖励，合作竞争的增强学习

论文标题

与历史相关的奖励，合作竞争的增强学习

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

论文作者

He, Keyang, Banerjee, Bikramjit, Doshi, Prashant

论文摘要

考虑一个典型的组织，其工人试图集体合作为其一般改善。但是，每个代理人同时寻求采取行动来确保比其年度补偿的同事更大的块，这通常来自{\ em固定}锅。因此，组织中的个人代理必须合作和竞争。许多组织的另一个功能是工人获得奖金，这通常是上一年总利润的一小部分。因此，代理商获得了奖励，该奖励也部分取决于历史表现。单个代理在这种情况下如何决定采取行动？近年来，很少有关于混合合作竞争性环境的方法，但是这些方法受到问题域的挑战，这些问题领域的奖励功能不仅取决于当前状态和行动。可以使用最近使用长期短期记忆（LSTM）的最深层多机构增强学习（MARL）方法，但是这些方法采用了与互动的共同观点，或者需要在代理商中明确交换信息以促进合作，这在竞争中可能是不可能的。在本文中，我们首先表明，代理的决策问题可以建模为一种互动的部分可观察到的马尔可夫决策过程（I-POMDP），该过程捕获了依赖历史的奖励的动态。我们提出了一种交互式优势参与者 - 批评方法（IA2C $^+$），该方法将独立优势 - 批判性网络与信念过滤器相结合，该信念过滤器维持对其他代理模型的信念分布。经验结果表明，IA2C $^+$比其他几个基本线（包括使用LSTM的基线）更快，更强大地学习最佳策略，即使归因于模型不正确。

Consider a typical organization whose worker agents seek to collectively cooperate for its general betterment. However, each individual agent simultaneously seeks to act to secure a larger chunk than its co-workers of the annual increment in compensation, which usually comes from a {\em fixed} pot. As such, the individual agent in the organization must cooperate and compete. Another feature of many organizations is that a worker receives a bonus, which is often a fraction of previous year's total profit. As such, the agent derives a reward that is also partly dependent on historical performance. How should the individual agent decide to act in this context? Few methods for the mixed cooperative-competitive setting have been presented in recent years, but these are challenged by problem domains whose reward functions do not depend on the current state and action only. Recent deep multi-agent reinforcement learning (MARL) methods using long short-term memory (LSTM) may be used, but these adopt a joint perspective to the interaction or require explicit exchange of information among the agents to promote cooperation, which may not be possible under competition. In this paper, we first show that the agent's decision-making problem can be modeled as an interactive partially observable Markov decision process (I-POMDP) that captures the dynamic of a history-dependent reward. We present an interactive advantage actor-critic method (IA2C$^+$), which combines the independent advantage actor-critic network with a belief filter that maintains a belief distribution over other agents' models. Empirical results show that IA2C$^+$ learns the optimal policy faster and more robustly than several other baselines including one that uses a LSTM, even when attributed models are incorrect.

下载PDF全文

下载文献需遵守相关版权规定

论文标题