论文标题
自适应经验选择政策梯度
Adaptive Experience Selection for Policy Gradient
论文作者
论文摘要
政策梯度加强学习(RL)算法在挑战性学习任务(例如持续控制)中取得了令人印象深刻的表现,但遭受了较高的样本复杂性。经验重播是一种提高样本效率的常用方法,但是使用过去轨迹的梯度估计器通常具有很高的差异。现有的经验重播策略,例如统一抽样或优先经验重播,不能明确试图控制梯度估计的方差。在本文中,我们提出了一种在线学习算法,自适应体验选择(AES),以自适应地学习体验抽样分布,以明确地最大程度地减少这一差异。使用遗憾的最小化方法,AES迭代更新体验抽样分布,以匹配假定具有最佳差异的竞争者分布的性能。通过提出动态(即更改时间)竞争者分布来解决样本非平稳性,为此提出了封闭形式解决方案。我们证明AES是具有合理样品复杂性的低温算法。从经验上讲,AES已被实施用于深层确定性政策梯度和软演员评论家算法,并对OpenAI Gym图书馆的8个连续控制任务进行了测试。我们的结果表明,与当前可用的经验抽样策略相比,AES可显着提高性能。
Policy gradient reinforcement learning (RL) algorithms have achieved impressive performance in challenging learning tasks such as continuous control, but suffer from high sample complexity. Experience replay is a commonly used approach to improve sample efficiency, but gradient estimators using past trajectories typically have high variance. Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates. In this paper, we propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance. Using a regret minimisation approach, AES iteratively updates the experience sampling distribution to match the performance of a competitor distribution assumed to have optimal variance. Sample non-stationarity is addressed by proposing a dynamic (i.e. time changing) competitor distribution for which a closed-form solution is proposed. We demonstrate that AES is a low-regret algorithm with reasonable sample complexity. Empirically, AES has been implemented for deep deterministic policy gradient and soft actor critic algorithms, and tested on 8 continuous control tasks from the OpenAI Gym library. Ours results show that AES leads to significantly improved performance compared to currently available experience sampling strategies for policy gradient.