基于增强学习的时间逻辑控制，并具有最大的概率满意度

论文标题

基于增强学习的时间逻辑控制，并具有最大的概率满意度

Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction

论文作者

Cai, Mingyu, Xiao, Shaoping, Li, Baoluo, Li, Zhiliang, Kan, Zhen

论文摘要

本文介绍了一种无模型的增强学习（RL）算法，以合成最大化线性时间逻辑（LTL）规格满意度的控制策略。由于考虑了环境和运动不确定性，我们将机器人运动建模为具有未知过渡概率和未知概率标签功能的概率标记的马尔可夫决策过程。 LTL任务规范将转换为限制确定性的广义Büchi自动机（LDGBA），并具有几个接受集以维持学习过程中浓厚的奖励的集合。应用LDGBA的新颖性是通过设计同步跟踪范围函数来构建嵌入式LDGBA（E-LDGBA），该函数能够在不增加尺寸和计算复杂性的情况下进行非访问的接受集合的记录。通过适当的依赖奖励和折扣功能，严格的分析表明，任何优化基于RL的方法预期折扣回报的方法都可以确保找到最大化LTL规格满意度概率的最佳政策。开发了一种基于无RL模型的运动计划策略，以在本文中产生最佳政策。通过模拟和实验结果证明了基于RL的对照合成的有效性。

This paper presents a model-free reinforcement learning (RL) algorithm to synthesize a control policy that maximizes the satisfaction probability of linear temporal logic (LTL) specifications. Due to the consideration of environment and motion uncertainties, we model the robot motion as a probabilistic labeled Markov decision process with unknown transition probabilities and unknown probabilistic label functions. The LTL task specification is converted to a limit deterministic generalized Büchi automaton (LDGBA) with several accepting sets to maintain dense rewards during learning. The novelty of applying LDGBA is to construct an embedded LDGBA (E-LDGBA) by designing a synchronous tracking-frontier function, which enables the record of non-visited accepting sets without increasing dimensional and computational complexity. With appropriate dependent reward and discount functions, rigorous analysis shows that any method that optimizes the expected discount return of the RL-based approach is guaranteed to find the optimal policy that maximizes the satisfaction probability of the LTL specifications. A model-free RL-based motion planning strategy is developed to generate the optimal policy in this paper. The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题