通过欧米茄规范目标进行增强学习的奖励成型

论文标题

通过欧米茄规范目标进行增强学习的奖励成型

Reward Shaping for Reinforcement Learning with Omega-Regular Objectives

论文作者

Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.

论文摘要

最近，已经采用了成功的方法来利用对模型的免费增强学习学习的良好MDPS自动机（具有限制形式的非确定性形式），这是一类自动机，该自动机对游戏自动机和最广泛的限制限制确定性自动机构成。使用这些büchi自动机的基础是，对于MDP自动机，Büchi条件可以转化为可及性。这种翻译的缺点是，奖励平均而言是很晚的，这在学习过程中需要长时间的情节。我们设计了一种新的奖励成型方法，以克服这个问题。我们表明，由此产生的模型等于折扣价目标，并具有偏见的折扣，从而简化并改善了此方向的先前工作。

Recently, successful approaches have been made to exploit good-for-MDPs automata (Büchi automata with a restricted form of nondeterminism) for model free reinforcement learning, a class of automata that subsumes good for games automata and the most widespread class of limit deterministic automata. The foundation of using these Büchi automata is that the Büchi condition can, for good-for-MDP automata, be translated to reachability. The drawback of this translation is that the rewards are, on average, reaped very late, which requires long episodes during the learning process. We devise a new reward shaping approach that overcomes this issue. We show that the resulting model is equivalent to a discounted payoff objective with a biased discount that simplifies and improves on prior work in this direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题