基于演员批评的不当强化学习

论文标题

基于演员批评的不当强化学习

Actor-Critic based Improper Reinforcement Learning

论文作者

Zaki, Mohammadi, Mohan, Avinash, Gopalan, Aditya, Mannor, Shie

论文摘要

我们考虑一个不当的强化学习设置，其中为学习者提供了$ M $的基本控制器，以进行未知的马尔可夫决策过程，并希望最佳地结合它们，以生产可能胜过每个基本基础的可能性。这对于在不匹配或模拟环境中学习的跨控制器进行调整可能很有用，可以为给定的目标环境获得良好的控制器，而试验相对较少。在此方面，我们提出了两种算法：（1）一种基于政策梯度的方法；（2）可以根据可用信息在基于简单的参与者（AC）方案和天然参与者 - 批评（NAC）方案之间切换的算法。两种算法都在给定控制器的一类不当混合物上运行。对于第一种情况，我们得出融合率保证，假设访问梯度甲骨文。对于基于AC的方法，我们提供了基本AC案例中固定点的收敛率保证，并在NAC情况下为全球最佳选择提供了保证。（i）稳定卡特柱的标准控制理论基准的数值结果；（ii）一个受约束的排队任务表明，即使可以使用的基本政策不稳定，我们的不当政策优化算法也可以稳定系统。

We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题