通过行动指示的任务规格的强化学习

论文标题

通过行动指示的任务规格的强化学习

Reinforcement Learning for Task Specifications with Action-Constraints

论文作者

Raman, Arun, Shagrithaya, Keerthan, Bhatnagar, Shalabh

论文摘要

在本文中，我们使用离散事件系统的监督控制理论的概念来提出一种学习有限状态马尔可夫决策过程（MDP）的最佳控制政策的方法，其中（仅）某些动作序列被认为是不安全的（分别是安全的）。我们假设根据有限状态自动机给出了认为不安全和/或安全的一组动作序列；并提出一个主管，该主管在MDP的每个状态下都禁用一部分动作，以便满足动作顺序的约束。然后，我们提出了Q学习算法的版本，用于在存在非马克维亚动作序列和状态约束的情况下学习最佳策略，在此我们使用奖励机的开发来处理状态约束。我们使用一个示例说明了该方法，该示例捕获了基于自动机的非马克维亚状态的实用性和用于强化学习的动作规范，并在此设置中显示了模拟的结果。

In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题