在非平稳环境中进行数据有效策略优化的双元学习

论文标题

在非平稳环境中进行数据有效策略优化的双元学习

Double Meta-Learning for Data Efficient Policy Optimization in Non-Stationary Environments

论文作者

Aghapour, Elahe, Ayanian, Nora

论文摘要

我们对学习非平稳环境的学习模型感兴趣，这些模型可以作为多任务学习问题。无模型的增强学习算法可以在多任务学习中获得良好的渐近性能，这是由于它们的方法，以大量抽样的成本，这需要从头开始学习。尽管基于模型的方法是最有效的数据算法之一，但它们仍然在复杂的任务和模型不确定性方面挣扎。元强化学习通过快速利用元优势政策来完成新任务，以应对多任务学习的效率和概括挑战。在本文中，我们提出了一种元强化学习方法，以学习以后的非平稳环境的动态模型。由于基于模型的学习方法的样本效率，我们能够同时训练非平稳环境的元模型和元模型直至动态模型收敛。然后，环境的元学习动态模型将生成模拟数据以进行元式优化。我们的实验表明，我们提出的方法可以通过基于模型的学习方法的数据效率在非平稳的环境中衡量政策，同时实现无模型元强化学习的高渐近性能。

We are interested in learning models of non-stationary environments, which can be framed as a multi-task learning problem. Model-free reinforcement learning algorithms can achieve good asymptotic performance in multi-task learning at a cost of extensive sampling, due to their approach, which requires learning from scratch. While model-based approaches are among the most data efficient learning algorithms, they still struggle with complex tasks and model uncertainties. Meta-reinforcement learning addresses the efficiency and generalization challenges on multi task learning by quickly leveraging the meta-prior policy for a new task. In this paper, we propose a meta-reinforcement learning approach to learn the dynamic model of a non-stationary environment to be used for meta-policy optimization later. Due to the sample efficiency of model-based learning methods, we are able to simultaneously train both the meta-model of the non-stationary environment and the meta-policy until dynamic model convergence. Then, the meta-learned dynamic model of the environment will generate simulated data for meta-policy optimization. Our experiment demonstrates that our proposed method can meta-learn the policy in a non-stationary environment with the data efficiency of model-based learning approaches while achieving the high asymptotic performance of model-free meta-reinforcement learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题