关于用个性化政策的元加强学习的融合理论

论文标题

关于用个性化政策的元加强学习的融合理论

On the Convergence Theory of Meta Reinforcement Learning with Personalized Policies

论文作者

Wang, Haozhi, Wang, Qing, Shao, Yunfeng, Li, Dong, Hao, Jianye, Li, Yinchuan

论文摘要

现代的元强化学习（META-RL）方法主要基于模型 - 不合静态的元学习开发，该方法在跨任务中执行策略梯度步骤以最大程度地提高策略绩效。但是，在元RL中，梯度冲突问题仍然很少了解，这可能会导致遇到不同任务时的性能退化。为了应对这一挑战，本文提出了一种新型的个性化元素RL（PMETA-RL）算法，该算法汇总了特定任务的个性化策略，以更新用于所有任务的元政策，同时维护个性化政策，以最大程度地在元数据限制下最大化每个任务的平均返回。我们还提供了表格设置下的理论分析，该分析证明了我们的PMETA-RL算法的收敛性。此外，我们将提出的PMETA-RL算法扩展到基于软批评的深网络版本，使其适合连续控制任务。实验结果表明，所提出的算法在健身房和Mujoco套件上的其他以前的元rl算法表现优于其他算法。

Modern meta-reinforcement learning (Meta-RL) methods are mainly developed based on model-agnostic meta-learning, which performs policy gradient steps across tasks to maximize policy performance. However, the gradient conflict problem is still poorly understood in Meta-RL, which may lead to performance degradation when encountering distinct tasks. To tackle this challenge, this paper proposes a novel personalized Meta-RL (pMeta-RL) algorithm, which aggregates task-specific personalized policies to update a meta-policy used for all tasks, while maintaining personalized policies to maximize the average return of each task under the constraint of the meta-policy. We also provide the theoretical analysis under the tabular setting, which demonstrates the convergence of our pMeta-RL algorithm. Moreover, we extend the proposed pMeta-RL algorithm to a deep network version based on soft actor-critic, making it suitable for continuous control tasks. Experiment results show that the proposed algorithms outperform other previous Meta-RL algorithms on Gym and MuJoCo suites.

下载PDF全文

下载文献需遵守相关版权规定

论文标题