论文标题

每次学习马尔可夫决策过程

Robust Anytime Learning of Markov Decision Processes

论文作者

Suilen, Marnix, Simão, Thiago D., Parker, David, Jansen, Nils

论文摘要

马尔可夫决策过程(MDP)是在顺序决策中常用的正式模型。 MDP捕获了可能出现的随机性,例如,通过过渡函数中的概率不准确的执行器。但是,在数据驱动的应用程序中,从(有限)数据中得出精确的概率引入了可能导致意外或不良结果的统计错误。不确定的MDP(UMDP)不需要精确的概率,而是在过渡中使用所谓的不确定性集,占此类有限的数据。正式验证社区的工具有效地计算了强大的政策,这些政策可在不确定性集合中最坏的情况下遵守正式规格,例如安全限制。我们不断地以强大的学习方法与将专用的贝叶斯推理方案与健壮策略的计算结合在一起的任何时间学习方法中不断学习MDP的过渡概率。特别是,我们的方法(1)将概率近似为间隔,(2)适应可能与中间模型不一致的新数据,并且可以随时停止(3)在UMDP上计算稳健的策略,以忠实地捕获数据到目前为止。此外,我们的方法能够适应环境变化。我们展示了我们的方法的有效性,并将其与在对几个基准测试的实验评估中所学到的UMDP上计算出的强大策略进行了比较。

Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. Furthermore, our method is capable of adapting to changes in the environment. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源