汉密尔顿 - 雅各比 - 贝尔曼方程和凸的无模型表征在连续时间

论文标题

汉密尔顿 - 雅各比 - 贝尔曼方程和凸的无模型表征在连续时间

Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous Time

论文作者

Lu, Fan, Mathias, Joel, Meyn, Sean, Kalsi, Karanjit

论文摘要

凸Q学习是一种强化学习的最新方法，这是出于更坚定的融合理论的可能性以及利用有关政策或价值功能结构的更先验知识的可能性。本文以有限的最佳控制目标探讨了连续时域中的算法设计。主要贡献是（i）算法设计是基于新的Q-ode，该Q-ode定义了汉密尔顿 - 雅各比 - 贝尔曼方程的模型表征。（ii）Q-ode激发了凸Q学习的新表述，以避免了先前的工作中出现的近似值。算法中使用的钟声错误是通过过滤测量来定义的，这在存在测量噪声的情况下是有益的。（iii）通过离散时间设置的最新结果的非平凡扩展获得约束区域的界定性的表征。（iv）该理论在应用于分布式能源资源的资源分配中进行了说明，该理论非常适合该理论。

Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with finite-horizon optimal control objective. The main contributions are (i) Algorithm design is based on a new Q-ODE, which defines the model-free characterization of the Hamilton-Jacobi-Bellman equation. (ii) The Q-ODE motivates a new formulation of Convex Q-learning that avoids the approximations appearing in prior work. The Bellman error used in the algorithm is defined by filtered measurements, which is beneficial in the presence of measurement noise. (iii) A characterization of boundedness of the constraint region is obtained through a non-trivial extension of recent results from the discrete time setting. (iv) The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.

下载PDF全文

下载文献需遵守相关版权规定

论文标题