有界风险敏感的马尔可夫游戏：前瞻性政策设计和迭代推理和累积前景理论的倒数奖励学习

论文标题

有界风险敏感的马尔可夫游戏：前瞻性政策设计和迭代推理和累积前景理论的倒数奖励学习

Bounded Risk-Sensitive Markov Games: Forward Policy Design and Inverse Reward Learning with Iterative Reasoning and Cumulative Prospect Theory

论文作者

Tian, Ran, Sun, Liting, Tomizuka, Masayoshi

论文摘要

在远期政策设计问题和逆奖励学习问题中，多代理系统的经典游戏理论方法通常会成为强大的合理性假设：代理在不确定性下完美地最大化了预期的实用程序。然而，这种假设与观察到的人的行为（例如满足亚地区，寻求风险和损失的决策）明显不匹配。在本文中，我们调查了对风险敏感的马尔可夫游戏（BRSMG）的问题及其对人类现实行为和学习人类行为模型进行建模的逆奖励学习问题。利用迭代推理模型和累积前景理论，我们拥抱了人类在BRSMG中具有界限并最大程度地提高了风险敏感的公用事业。在BRSMG框架下建立了远期政策设计和逆奖励学习问题的收敛分析。我们验证了导航方案中提议的远期政策设计和逆向奖励学习算法。结果表明，代理人的行为既表现出规避风险的和寻求风险的特征。此外，在逆向奖励学习任务中，提议的有限风险敏感的反向学习算法优于基线风险中性逆学习算法，不仅有效地恢复了更准确的奖励值，还可以有效恢复智能水平，还可以恢复给定智力参数的智能奖励值和风险测量参数。

Classical game-theoretic approaches for multi-agent systems in both the forward policy design problem and the inverse reward learning problem often make strong rationality assumptions: agents perfectly maximize expected utilities under uncertainties. Such assumptions, however, substantially mismatch with observed humans' behaviors such as satisficing with sub-optimal, risk-seeking, and loss-aversion decisions. In this paper, we investigate the problem of bounded risk-sensitive Markov Game (BRSMG) and its inverse reward learning problem for modeling human realistic behaviors and learning human behavioral models. Drawing on iterative reasoning models and cumulative prospect theory, we embrace that humans have bounded intelligence and maximize risk-sensitive utilities in BRSMGs. Convergence analysis for both the forward policy design and the inverse reward learning problems are established under the BRSMG framework. We validate the proposed forward policy design and inverse reward learning algorithms in a navigation scenario. The results show that the behaviors of agents demonstrate both risk-averse and risk-seeking characteristics. Moreover, in the inverse reward learning task, the proposed bounded risk-sensitive inverse learning algorithm outperforms a baseline risk-neutral inverse learning algorithm by effectively recovering not only more accurate reward values but also the intelligence levels and the risk-measure parameters given demonstrations of agents' interactive behaviors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题