论文标题
政策歧管搜索改善基于多样性的神经进化
Policy Manifold Search for Improving Diversity-based Neuroevolution
论文作者
论文摘要
基于多样性的方法最近已成为基于绩效的政策搜索的替代范式。这个家庭的一种流行方法,质量多样性(QD),维护了一系列基于政策的推出行为定义的多样性 - 金属空间中分开的高性能政策。当策略被参数为神经网络(即神经进化)时,QD倾向于随参数空间维度的尺寸缩放。我们的假设是存在嵌入在策略参数空间中的低维歧管,其中包含高密度的多种多样和可行的策略。我们通过神经进化提出了一种基于多样性的策略搜索的新方法,该方法利用了捕获数据本地结构的策略参数的潜在表示。我们的方法迭代地根据QD框架收集策略,以(i)构建各种策略的集合,(ii)使用它来学习策略参数的潜在表示,(iii)在博学的潜在空间中执行策略搜索。我们使用逆变换的雅各布式(即重建函数)来指导潜在空间中的搜索。这样可以确保重建后生成的样品保留在原始空间的高密度区域中。我们在模拟环境中评估了三个连续控制任务的贡献,并与基于多样性的基线相比。研究结果表明,我们的方法产生了更有效,更强大的政策搜索过程。
Diversity-based approaches have recently gained popularity as an alternative paradigm to performance-based policy search. A popular approach from this family, Quality-Diversity (QD), maintains a collection of high-performing policies separated in the diversity-metric space, defined based on policies' rollout behaviours. When policies are parameterised as neural networks, i.e. Neuroevolution, QD tends to not scale well with parameter space dimensionality. Our hypothesis is that there exists a low-dimensional manifold embedded in the policy parameter space, containing a high density of diverse and feasible policies. We propose a novel approach to diversity-based policy search via Neuroevolution, that leverages learned latent representations of the policy parameters which capture the local structure of the data. Our approach iteratively collects policies according to the QD framework, in order to (i) build a collection of diverse policies, (ii) use it to learn a latent representation of the policy parameters, (iii) perform policy search in the learned latent space. We use the Jacobian of the inverse transformation (i.e.reconstruction function) to guide the search in the latent space. This ensures that the generated samples remain in the high-density regions of the original space, after reconstruction. We evaluate our contributions on three continuous control tasks in simulated environments, and compare to diversity-based baselines. The findings suggest that our approach yields a more efficient and robust policy search process.