POMO：具有多个优化的政策优化用于增强学习

论文标题

POMO：具有多个优化的政策优化用于增强学习

POMO: Policy Optimization with Multiple Optima for Reinforcement Learning

论文作者

Kwon, Yeong-Dae, Choo, Jinho, Kim, Byoungjip, Yoon, Iljoo, Gwon, Youngjune, Min, Seungjai

论文摘要

在神经组合优化（CO）中，增强学习（RL）可以将深度神经网络变成NP-硬性问题的快速，强大的启发式求解器。这种方法在实际应用中具有很大的潜力，因为它允许在没有拥有大量领域知识的专家指南的情况下找到近乎最佳的解决方案。我们使用多个Optima（POMO）介绍了策略优化，这是一种建立这种启发式求解器的端到端方法。 POMO适用于广泛的CO问题。它旨在利用CO解决方案表示的对称性。 POMO使用了一种改良的增强算法，该算法迫使各种推广到所有最佳解决方案。从经验上讲，POMO的低变异基线使RL训练快速稳定，与以前的方法相比，它对局部最小值具有更大的抵抗力。我们还引入了一种新的基于增强的推理方法，该方法很好地伴随了Pomo。我们通过解决了三个流行的NP-HARD问题，即旅行推销员（TSP），电容车辆路由（CVRP）和0-1 Knapsack（KP）来证明POMO的有效性。在这三者中，基于POMO的求解器在所有最近学习的启发式方法中都显示出绩效的显着改善。特别是，我们使用TSP100实现了0.14％的最佳差距，同时将推理时间降低了超过一个数量级。

In neural combinatorial optimization (CO), reinforcement learning (RL) can turn a deep neural net into a fast, powerful heuristic solver of NP-hard problems. This approach has a great potential in practical applications because it allows near-optimal solutions to be found without expert guides armed with substantial domain knowledge. We introduce Policy Optimization with Multiple Optima (POMO), an end-to-end approach for building such a heuristic solver. POMO is applicable to a wide range of CO problems. It is designed to exploit the symmetries in the representation of a CO solution. POMO uses a modified REINFORCE algorithm that forces diverse rollouts towards all optimal solutions. Empirically, the low-variance baseline of POMO makes RL training fast and stable, and it is more resistant to local minima compared to previous approaches. We also introduce a new augmentation-based inference method, which accompanies POMO nicely. We demonstrate the effectiveness of POMO by solving three popular NP-hard problems, namely, traveling salesman (TSP), capacitated vehicle routing (CVRP), and 0-1 knapsack (KP). For all three, our solver based on POMO shows a significant improvement in performance over all recent learned heuristics. In particular, we achieve the optimality gap of 0.14% with TSP100 while reducing inference time by more than an order of magnitude.

下载PDF全文

下载文献需遵守相关版权规定

论文标题