论文标题
表演性强化学习
Performative Reinforcement Learning
论文作者
论文摘要
我们介绍了表演性强化学习的框架,学习者选择的政策影响了环境的基本奖励和过渡动态。遵循有关表演预测的最新文献〜\ cite {Perdomo等。 Al。,2020},我们介绍了性能稳定政策的概念。然后,我们考虑强化学习问题的正则版本,并表明在合理的假设对过渡动力学的合理假设下,反复优化此目标会收敛到性能稳定的策略。我们的证明利用了强化学习问题的双重观点,并且可能在分析其他算法与决策依赖性环境的融合方面具有独立的兴趣。然后,我们将结果扩展到学习者仅执行梯度上升步骤而不是完全优化目标的设置,以及学习者可以从变化的环境中访问有限数量的轨迹的设置。对于这两种设置,我们都利用了表演性增强学习的双重表述,并建立了融合到稳定的解决方案。最后,通过对网格世界环境的广泛实验,我们证明了收敛对各种参数的依赖性,例如正则化,平滑度和样品数量。
We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.