论文标题
部分可观测时空混沌系统的无模型预测
Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search
论文作者
论文摘要
在不确定性下的决策(DMU)存在于许多重要问题中。一个开放的挑战是在非平稳环境中的DMU,环境的动态可以随着时间的推移而发生变化。强化学习(RL)是一种流行的DMU问题方法,它通过与环境模型离线互动来学习政策。不幸的是,如果环境改变,策略可能会变成陈旧,并采取次优的措施,并为更新环境的策略重新学习需要时间和计算工作。另一种选择是在线规划方法,例如蒙特卡洛树搜索(MCT),该方法在决策时执行其计算。鉴于当前的环境,MCTS计划使用高保真模型来确定有希望的动作轨迹。一旦检测到环境变化,就可以立即更新这些模型,以立即将它们纳入决策。但是,对于具有较大状态空间的域而言,MCT的收敛性可能很慢。在本文中,我们提出了一种新颖的混合决策方法,该方法结合了RL的优势和计划,同时减轻了它们的劣势。我们的方法称为策略增强MCT(PA-MCT),将策略的肌动蛋白价值估计纳入MCT,利用估计值来播种搜索所青睐的动作轨迹。我们假设PA-MCT会比标准MCT更快地收敛,同时做出比面对非组织环境时自行制定的更好的决策。我们通过将PA-MCT与纯MCT和应用于经典Cartpole环境的RL试剂进行比较来检验假设。我们发现,PC-MCT可以在几个环境变化下孤立的策略获得更高的累积奖励,同时比纯MCT的迭代率明显更少。
Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. Reinforcement Learning (RL), a popular approach for DMU problems, learns a policy by interacting with a model of the environment offline. Unfortunately, if the environment changes the policy can become stale and take sub-optimal actions, and relearning the policy for the updated environment takes time and computational effort. An alternative is online planning approaches such as Monte Carlo Tree Search (MCTS), which perform their computation at decision time. Given the current environment, MCTS plans using high-fidelity models to determine promising action trajectories. These models can be updated as soon as environmental changes are detected to immediately incorporate them into decision making. However, MCTS's convergence can be slow for domains with large state-action spaces. In this paper, we present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses. Our approach, called Policy Augmented MCTS (PA-MCTS), integrates a policy's actin-value estimates into MCTS, using the estimates to seed the action trajectories favored by the search. We hypothesize that PA-MCTS will converge more quickly than standard MCTS while making better decisions than the policy can make on its own when faced with nonstationary environments. We test our hypothesis by comparing PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole environment. We find that PC-MCTS can achieve higher cumulative rewards than the policy in isolation under several environmental shifts while converging in significantly fewer iterations than pure MCTS.