加固学习中的保守探索

论文标题

加固学习中的保守探索

Conservative Exploration in Reinforcement Learning

论文作者

Garcelon, Evrard, Ghavamzadeh, Mohammad, Lazaric, Alessandro, Pirotta, Matteo

论文摘要

在未知的马尔可夫决策过程（MDP）中学习时，代理商应进行贸易探索以发现有关MDP的新信息，并剥削当前知识以最大程度地提高奖励。尽管代理商最终将学习良好或最佳的政策，但无法保证中级政策的质量。在现实世界应用程序中，这种缺乏控制的不希望是最低要求是保证执行的策略至少与现有基线一样。在本文中，我们介绍了保守探索的概念，以解决平均奖励和有限的地平线问题。我们提出了两种乐观的算法，这些算法保证（W.H.P.）保守的约束在学习过程中永远不会受到侵犯。我们得出遗憾的界限，表明保守派不会阻碍这些算法的学习能力。

While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as well as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题