论文标题
未知时变动力系统的在线控制
Online Control of Unknown Time-Varying Dynamical Systems
论文作者
论文摘要
我们在非策略控制模型中研究具有未知动力学的时变线性系统的在线控制。在高水平上,我们证明了这种设置比未知的时间不变或已知的时变动力学要等于\ emph {质量上的困难},并且在可能会有可能遗憾的是算法上的算法上限中,我们的负面结果与算法上限相辅相成。更具体地说,我们研究了关于普通政策类别的遗憾界限:干扰行动(SLS),干扰响应(Youla)和线性反馈政策。尽管这三个类在LTI系统上基本上是等效的,但我们证明了这些等效性分解了时变系统。 我们证明,除非系统可变性的一定量度也可以在地平线上缩放,否则没有算法可以对前两个类产生sublinear遗憾。此外,我们表明对国家线性反馈政策的离线计划是NP-HARD,这表明在线学习问题的硬度。 从积极的一面来看,我们给出了一种有效的算法,该算法对上述系统可变性项的跨性别响应策略产生了统一的遗憾。实际上,我们的算法享受sublinear \ emph {自适应}遗憾的界限,这比标准遗憾更强大,并且更适合随时间变化的系统。我们将扩展概述为干扰行动策略和部分观察,并提出了一种对线性状态反馈政策的遗憾算法的效率低下。
We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is \emph{qualitatively harder} than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems. We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem. On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear \emph{adaptive} regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.