符号自动编码目标的无模型增强学习

论文标题

符号自动编码目标的无模型增强学习

Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives

论文作者

Balakrishnan, Anand, Jakšić, Stefan, Aguilar, Edgar A., Ničković, Dejan, Deshmukh, Jyotirmoy V.

论文摘要

强化学习（RL）是在不确定环境中进行机器人路径计划的流行方法。但是，对RL代理进行培训的控制策略至关重要取决于用户定义的，基于州的奖励功能。设计不佳的奖励可以导致政策确实获得最大的奖励，但无法满足所需的任务目标或不安全。有几个使用正式语言的示例，例如时间逻辑和自动机指定机器人的高级任务规格（代替马尔可夫奖励）。最近的努力集中在从形式规格中推断基于州的奖励；在这里，目标是提供（概率）保证使用RL（带有推断的奖励）学习的政策满足高级正式规格。其中几种技术的一个关键缺点是，他们推断的奖励很少：代理只有在完成任务后才获得正奖励，否则就没有奖励。这自然会导致RL期间的收敛特性差和较高的差异。在这项工作中，我们建议以符号自动机的形式使用形式规格：它们是对基于时间逻辑的有限时间和自动机的概括。此外，我们对符号自动机的使用使我们能够定义基于奖励表面的非Sparse潜在奖励，从而在RL期间获得更好的收敛性。我们还表明，我们的潜在奖励策略仍然使我们能够获得最大化给定规范满意度的政策。

Reinforcement learning (RL) is a popular approach for robotic path planning in uncertain environments. However, the control policies trained for an RL agent crucially depend on user-defined, state-based reward functions. Poorly designed rewards can lead to policies that do get maximal rewards but fail to satisfy desired task objectives or are unsafe. There are several examples of the use of formal languages such as temporal logics and automata to specify high-level task specifications for robots (in lieu of Markovian rewards). Recent efforts have focused on inferring state-based rewards from formal specifications; here, the goal is to provide (probabilistic) guarantees that the policy learned using RL (with the inferred rewards) satisfies the high-level formal specification. A key drawback of several of these techniques is that the rewards that they infer are sparse: the agent receives positive rewards only upon completion of the task and no rewards otherwise. This naturally leads to poor convergence properties and high variance during RL. In this work, we propose using formal specifications in the form of symbolic automata: these serve as a generalization of both bounded-time temporal logic-based specifications as well as automata. Furthermore, our use of symbolic automata allows us to define non-sparse potential-based rewards which empirically shape the reward surface, leading to better convergence during RL. We also show that our potential-based rewarding strategy still allows us to obtain the policy that maximizes the satisfaction of the given specification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题