论文标题

在连续动作空间中的策略镜子上升的隐藏偏见上

On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

论文作者

Bedi, Amrit Singh, Chakraborty, Souradip, Parayil, Anjaly, Sadler, Brian, Tokekar, Pratap, Koppel, Alec

论文摘要

我们专注于在连续的动作空间上进行强化学习的参数化策略搜索。通常,假设与策略相关的分数函数是有限的,即使对于高斯政策也无法保持。要正确解决此问题,必须引入一个探索公差参数,以量化其所在区域。这样做会产生持续的偏见,该偏见出现在预期的政策梯度规范的衰减率中,该规范与行动空间的半径成反比。为了减轻这种隐藏的偏见,可以使用重型策略参数化,这些策略参数化表现出有界的分数函数,但是这样做可能会导致算法更新中的不稳定。为了解决这些问题,在这项工作中,我们研究了在重尾参数下的政策梯度算法的融合,我们建议通过镜像上升型更新和梯度跟踪结合进行稳定。我们的主要理论贡献是该方案以恒定的步骤和批量大小收敛的建立,而先前的工作要求这些参数分别缩小到无效或成长为无限。在实验上,与标准基准相比,在重尾策略参数化下,该方案在各种环境中的奖励积累得到了改善。

We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源