论文标题
强大强化学习的政策梯度方法
Policy Gradient Method For Robust Reinforcement Learning
论文作者
论文摘要
本文开发了第一个政策梯度方法,并具有全球最佳保证和在模型不匹配下的强大增强学习的复杂性分析。强大的强化学习是学习一项策略,以模拟模拟器和真实环境之间的不匹配。我们首先开发可靠的策略(子)梯度,该梯度适用于任何可区分的参数策略类。我们表明,在直接的策略参数化下,提出的强大策略梯度方法会收敛到全局最佳的渐近级。我们进一步开发了一种平滑的稳健策略梯度方法,并表明要实现$ε$ -Global的最佳选择,复杂性为$ \ MATHCAL O(ε^{ - 3})$。然后,我们将我们的方法扩展到无模型设置,并使用可区分的参数策略类别和值函数设计强大的参与者方法。我们进一步表征其在表格设置下的渐近收敛性和样品复杂性。最后,我们提供模拟结果以证明我们方法的鲁棒性。
This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch. Robust reinforcement learning is to learn a policy robust to model mismatch between simulator and real environment. We first develop the robust policy (sub-)gradient, which is applicable for any differentiable parametric policy class. We show that the proposed robust policy gradient method converges to the global optimum asymptotically under direct policy parameterization. We further develop a smoothed robust policy gradient method and show that to achieve an $ε$-global optimum, the complexity is $\mathcal O(ε^{-3})$. We then extend our methodology to the general model-free setting and design the robust actor-critic method with differentiable parametric policy class and value function. We further characterize its asymptotic convergence and sample complexity under the tabular setting. Finally, we provide simulation results to demonstrate the robustness of our methods.