论文标题

基于演员批评的不当强化学习

Actor-Critic based Improper Reinforcement Learning

论文作者

Zaki, Mohammadi, Mohan, Avinash, Gopalan, Aditya, Mannor, Shie

论文摘要

我们考虑一个不当的强化学习设置,其中为学习者提供了$ M $的基本控制器,以进行未知的马尔可夫决策过程,并希望最佳地结合它们,以生产可能胜过每个基本基础的可能性。这对于在不匹配或模拟环境中学习的跨控制器进行调整可能很有用,可以为给定的目标环境获得良好的控制器,而试验相对较少。 在此方面,我们提出了两种算法:(1)一种基于政策梯度的方法; (2)可以根据可用信息在基于简单的参与者(AC)方案和天然参与者 - 批评(NAC)方案之间切换的算法。两种算法都在给定控制器的一类不当混合物上运行。对于第一种情况,我们得出融合率保证,假设访问梯度甲骨文。对于基于AC的方法,我们提供了基本AC案例中固定点的收敛率保证,并在NAC情况下为全球最佳选择提供了保证。 (i)稳定卡特柱的标准控制理论基准的数值结果; (ii)一个受约束的排队任务表明,即使可以使用的基本政策不稳定,我们的不当政策优化算法也可以稳定系统。

We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源