近似信息状态，用于在部分观察系统中进行近似计划和强化学习

论文标题

近似信息状态，用于在部分观察系统中进行近似计划和强化学习

Approximate information state for approximate planning and reinforcement learning in partially observed systems

论文作者

Subramanian, Jayakumar, Sinha, Amit, Seraj, Raihan, Mahajan, Aditya

论文摘要

我们提出了一个理论框架，用于在部分观察到的系统中进行近似计划和学习。我们的框架基于信息状态的基本概念。我们提供了两个等效的信息状态定义 - i）历史的函数，足以计算预期的奖励并预测其下一个价值； ii）等效地，历史函数可以递归更新，并且足以计算预期的奖励并预测下一个观察结果。信息状态总是导致动态编程分解。我们的关键结果是表明，如果历史记录的函数（称为近似信息状态（AIS））近似满足信息状态的属性，则具有相应的近似动态程序。我们表明，使用此功能计算的策略大致具有最佳性，并且有限的最佳损失。我们表明，文献中的几个近似，观察和行动空间可以看作是AIS的实例。在某些情况下，我们获得了更严格的范围。 AIS的一个显着特征是可以从数据中学到它。我们提出基于AIS的多个时间秤策略梯度算法。以及较低，中和高维环境的详细数值实验。

We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two equivalent definitions of information state -- i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题