论文标题
关于用个性化政策的元加强学习的融合理论
On the Convergence Theory of Meta Reinforcement Learning with Personalized Policies
论文作者
论文摘要
现代的元强化学习(META-RL)方法主要基于模型 - 不合静态的元学习开发,该方法在跨任务中执行策略梯度步骤以最大程度地提高策略绩效。但是,在元RL中,梯度冲突问题仍然很少了解,这可能会导致遇到不同任务时的性能退化。为了应对这一挑战,本文提出了一种新型的个性化元素RL(PMETA-RL)算法,该算法汇总了特定任务的个性化策略,以更新用于所有任务的元政策,同时维护个性化政策,以最大程度地在元数据限制下最大化每个任务的平均返回。我们还提供了表格设置下的理论分析,该分析证明了我们的PMETA-RL算法的收敛性。此外,我们将提出的PMETA-RL算法扩展到基于软批评的深网络版本,使其适合连续控制任务。实验结果表明,所提出的算法在健身房和Mujoco套件上的其他以前的元rl算法表现优于其他算法。
Modern meta-reinforcement learning (Meta-RL) methods are mainly developed based on model-agnostic meta-learning, which performs policy gradient steps across tasks to maximize policy performance. However, the gradient conflict problem is still poorly understood in Meta-RL, which may lead to performance degradation when encountering distinct tasks. To tackle this challenge, this paper proposes a novel personalized Meta-RL (pMeta-RL) algorithm, which aggregates task-specific personalized policies to update a meta-policy used for all tasks, while maintaining personalized policies to maximize the average return of each task under the constraint of the meta-policy. We also provide the theoretical analysis under the tabular setting, which demonstrates the convergence of our pMeta-RL algorithm. Moreover, we extend the proposed pMeta-RL algorithm to a deep network version based on soft actor-critic, making it suitable for continuous control tasks. Experiment results show that the proposed algorithms outperform other previous Meta-RL algorithms on Gym and MuJoCo suites.