论文标题
通过奖励开关政策优化不断发现新型策略
Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
论文作者
论文摘要
我们提出了奖励开关策略优化(RSPO),这是一种范式,可以通过迭代地找到与现有的策略相迭代,以发现复杂的RL环境中的各种策略。为了鼓励学习政策始终如一地融入以前未被发现的本地最佳最佳,RSPO通过基于轨迹的新颖性测量在优化过程中通过基于轨迹的新颖性测量在外部和内在奖励之间进行切换。当采样轨迹足够不同时,RSPO会通过外部奖励执行标准策略优化。对于在现有政策下具有很高可能性的轨迹,RSPO利用内在的多样性奖励来促进探索。实验表明,RSPO能够在各种领域中发现各种策略,从单格粒子世界任务和Mujoco连续控制到多代理的Stag-Hunt Games和Starcraftii挑战。
We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.