论文标题
Dyna-T:施加到树木的Dyna-Q和上限界
Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees
论文作者
论文摘要
在这项工作中,我们介绍了一种称为dyna-t的新算法的初步研究。在加强学习(RL)中,计划代理人自带地表示环境作为模型。为了发现与环境互动的最佳政策,代理商以试验和错误方式收集经验。经验可用于学习更好的模型或直接改善价值功能和政策。 Dyna-Q通常是分离的,是一种混合方法,在每次迭代中,它都利用了真实的体验来更新模型以及值函数,同时使用其模型中的模拟数据计划其操作。但是,计划过程在计算上是昂贵的,并且很大程度上取决于国家行动空间的维度。我们建议在模拟体验上建立上层信心树(UCT),并在线学习过程中寻找要选择的最佳动作。我们证明了我们提出的方法对开放AI的三个测试床环境的一组初步测试的有效性。与Dyna-Q相反,Dyna-t通过选择更健壮的动作选择策略来超过随机环境中最先进的RL代理。
In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.