论文标题
学习为对手学习建模
Learning to Model Opponent Learning
论文作者
论文摘要
多代理增强学习(MARL)考虑了设置,其中一组共存的代理相互互动及其环境。其他代理的适应和学习在环境动态中引起了非平稳性。这对基于价值函数的算法构成了巨大的挑战,这些算法通常依赖于固定环境的假设。政策搜索算法也在多代理设置中挣扎,因为尚不知道对手的行动所产生的部分可观察性引入了政策培训的高度差异。通常将对代理人的对手进行建模,以解决因学习对手的共存而引起的问题。对手模型为代理提供了一些能力,可以推理其他代理商来帮助自己的决策。大多数先前的工作通过假设对手使用固定策略或在一组固定策略之间切换来学习对手模型。这种方法可以减少政策搜索算法的培训信号差异。但是,在多代理环境中,代理有动力不断适应和学习。这意味着有关对手平稳性的假设是不现实的。在这项工作中,我们开发了一种新颖的方法来建模对手的学习动态,我们将其术语学习以建模对手学习(Lemol)。我们显示,与幼稚的行为克隆基线相比,我们的结构化对手模型更准确和稳定。我们进一步表明,对手建模可以改善多代理设置中算法的性能。
Multi-Agent Reinforcement Learning (MARL) considers settings in which a set of coexisting agents interact with one another and their environment. The adaptation and learning of other agents induces non-stationarity in the environment dynamics. This poses a great challenge for value function-based algorithms whose convergence usually relies on the assumption of a stationary environment. Policy search algorithms also struggle in multi-agent settings as the partial observability resulting from an opponent's actions not being known introduces high variance to policy training. Modelling an agent's opponent(s) is often pursued as a means of resolving the issues arising from the coexistence of learning opponents. An opponent model provides an agent with some ability to reason about other agents to aid its own decision making. Most prior works learn an opponent model by assuming the opponent is employing a stationary policy or switching between a set of stationary policies. Such an approach can reduce the variance of training signals for policy search algorithms. However, in the multi-agent setting, agents have an incentive to continually adapt and learn. This means that the assumptions concerning opponent stationarity are unrealistic. In this work, we develop a novel approach to modelling an opponent's learning dynamics which we term Learning to Model Opponent Learning (LeMOL). We show our structured opponent model is more accurate and stable than naive behaviour cloning baselines. We further show that opponent modelling can improve the performance of algorithmic agents in multi-agent settings.