具有B稳定性的部分可观察的RL：统一的结构条件和鲜明的样品效率算法

论文标题

具有B稳定性的部分可观察的RL：统一的结构条件和鲜明的样品效率算法

Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

论文作者

Chen, Fan, Bai, Yu, Mei, Song

论文摘要

部分可观察性 - 代理只能观察到有关系统真正潜在状态的部分信息 - 在增强学习的现实应用中无处不在（RL）。从理论上讲，在最坏情况下，由于指数样本的复杂性下限，在最坏的情况下，在最坏情况下学习了近乎最佳的政策。最近的工作已经确定了几个可通过多项式样本学习的可学性亚类，例如部分可观察到的马尔可夫决策过程（POMDPS）具有某些可揭示或可分解性条件。但是，这一研究仍处于起步阶段，（1）缺乏统一的结构条件，使样品有效学习；（2）现有的已知可导致子类的样品复杂性远非锋利；（3）与完全可观察到的RL相比，可用的样品有效算法更少。在预测状态表示（PSR）的一般环境中，本文在上面的所有三个方面都促进了部分可观察到的RL。首先，我们提出了一种称为\ emph {b稳定性}的自然和统一的结构条件。 B稳定的PSR包括绝大多数已知的可拖运子类，例如弱揭示的POMDP，低级别的未来富裕POMDP，可解码的POMDP和常规PSR。接下来，我们表明，在相关问题参数中，可以使用多项式样本学习任何B稳定PSR。当在上述子类中实例化时，我们的样本复杂性比当前最好的复杂性大大改善。最后，我们的结果是通过三种算法同时实现的：乐观的最大似然估计，估计到决策和基于模型的乐观后验采样。后两种算法是用于POMDPS/PSR的样品研究的新算法。

Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题