通过非指数折扣的增强学习

论文标题

通过非指数折扣的增强学习

Reinforcement Learning with Non-Exponential Discounting

论文作者

Schultheis, Matthias, Rothkopf, Constantin A., Koeppl, Heinz

论文摘要

通常，在加固学习（RL）中，奖励会随着时间的推移而使用指数函数来模拟时间偏好，从而限制了预期的长期奖励。相反，在经济学和心理学中，已经表明，人类通常采用双曲线折现方案，当假定特定任务终止时间分布时，这是最佳的。在这项工作中，我们提出了一种基于连续的基于模型的强化学习的理论，该理论概括为任意折扣功能。该公式涵盖了存在非指数随机终止时间的情况。我们得出了表征最佳策略的汉密尔顿 - 雅各比 - 贝尔曼（HJB）方程，并描述了如何使用搭配方法来解决它，该方法使用深度学习进行函数近似。此外，我们展示了如何解决逆RL问题，其中人们试图恢复给定决策数据的折扣功能的属性。我们在两个模拟问题上验证了我们提出的方法的适用性。我们的方法为分析人类在顺序决策任务中的折现开辟了道路。

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题