学习停止：动态模拟蒙特卡洛树搜索

论文标题

学习停止：动态模拟蒙特卡洛树搜索

Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search

论文作者

Lan, Li-Cheng, Tsai, Meng-Yu, Wu, Ti-Rong, Wu, I-Chen, Hsieh, Cho-Jui

论文摘要

与深神经网络（DNNS）结合时，蒙特卡洛树搜索（MCT）已在许多领域（例如GO和Atari游戏）中获得了最先进的结果。当执行更多模拟时，MCT可以实现更高的性能，但还需要大量的CPU和GPU资源。但是，并非所有州都需要长时间的搜索时间来确定代理商可以找到的最佳动作。例如，在19x19 GO和Nogo中，我们发现，在超过一半的州中，即使在搜索2分钟后，DNN预测的最佳动作仍保持不变。这意味着，如果我们能够对当前搜索结果充满信心，那么如果我们能够早些时候停止搜索，则可以节省大量资源。在本文中，我们建议通过预测当前搜索状态的不确定性并使用结果来决定是否应该停止搜索来实现这一目标。使用我们的算法，称为Dynamic Simulation MCT（DS-MCT），我们可以加快由Alphazero训练的Nogo代理，同时保持类似的获胜率，速度快2.5倍。同样，在相同的平均模拟计数下，我们的方法可以针对原始计划实现61％的获胜率。

Monte Carlo tree search (MCTS) has achieved state-of-the-art results in many domains such as Go and Atari games when combining with deep neural networks (DNNs). When more simulations are executed, MCTS can achieve higher performance but also requires enormous amounts of CPU and GPU resources. However, not all states require a long searching time to identify the best action that the agent can find. For example, in 19x19 Go and NoGo, we found that for more than half of the states, the best action predicted by DNN remains unchanged even after searching 2 minutes. This implies that a significant amount of resources can be saved if we are able to stop the searching earlier when we are confident with the current searching result. In this paper, we propose to achieve this goal by predicting the uncertainty of the current searching status and use the result to decide whether we should stop searching. With our algorithm, called Dynamic Simulation MCTS (DS-MCTS), we can speed up a NoGo agent trained by AlphaZero 2.5 times faster while maintaining a similar winning rate. Also, under the same average simulation count, our method can achieve a 61% winning rate against the original program.

下载PDF全文

下载文献需遵守相关版权规定

论文标题