MDP上的微积分：潜在形状作为梯度

论文标题

MDP上的微积分：潜在形状作为梯度

Calculus on MDPs: Potential Shaping as a Gradient

论文作者

Jenner, Erik, van Hoof, Herke, Gleave, Adam

论文摘要

在强化学习中，就其诱导的最佳政策而言，不同的奖励功能可以等效。一个特别众所周知的重要例子是潜在的塑造，可以将一类函数添加到任何奖励功能中，而无需更改任意过渡动态下设置的最佳策略。潜在的塑造在概念上类似于数学和物理学中的潜在，保守的矢量场和规范变换，但是以前没有正式探索这种联系。我们在图表上开发了一种形式主义，用于抽象马尔可夫决策过程的图表，并展示如何将潜在塑造正式解释为本框架中的梯度。这使我们能够加强Ng等人的结果。（1999）描述了潜在塑造是始终保留最佳政策的唯一添加奖励转换的条件。作为我们形式主义的附加应用，我们定义了从每个潜在塑造等效类中挑选单个独特奖励功能的规则。

In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce. A particularly well-known and important example is potential shaping, a class of functions that can be added to any reward function without changing the optimal policy set under arbitrary transition dynamics. Potential shaping is conceptually similar to potentials, conservative vector fields and gauge transformations in math and physics, but this connection has not previously been formally explored. We develop a formalism for discrete calculus on graphs that abstract a Markov Decision Process, and show how potential shaping can be formally interpreted as a gradient within this framework. This allows us to strengthen results from Ng et al. (1999) describing conditions under which potential shaping is the only additive reward transformation to always preserve optimal policies. As an additional application of our formalism, we define a rule for picking a single unique reward function from each potential shaping equivalence class.

下载PDF全文

下载文献需遵守相关版权规定

论文标题