梯度监控的增强学习

论文标题

梯度监控的增强学习

Gradient Monitored Reinforcement Learning

论文作者

Hameed, Mohammed Sharafath Abdul, Chadha, Gavneet Singh, Schwung, Andreas, Ding, Steven X.

论文摘要

本文提出了一种新型的神经网络训练方法，可在深度强化学习中更快地收敛和更好的概括能力。特别是，我们专注于通过系统地减少梯度的差异并提供更具针对性的学习过程来增强培训和评估性能在增强学习算法中的增强。我们称为梯度监测（GM）的提议方法是一种基于训练过程本身的动态发展和反馈，将学习在神经网络的重量参数中引导。我们提出了GM方法论的不同变体，这些变体已被证明可以提高模型的基础性能。提出的一种带有梯度监测（M-WGM）的变体动量之一，可以根据某些学习参数对网络中反向传播梯度的量子进行连续调整。我们通过梯度监测（AM-WGM）方法进一步增强了自适应动量的方法，该方法可以自动调整某些权重的专注学习，而不是从收集的奖励的反馈中进行的学习范围更分散的学习。作为副产品，随着算法会自动冻结经过训练的重量，它还允许在训练过程中自动推导所需的深网尺寸。该方法分别应用于两个离散的（多机器人协调问题和ATARI游戏）和一项连续控制任务（Mujoco），分别使用Advantage Actor-Critic（A2C）和近端策略优化（PPO）。获得的结果特别强调了该方法在概括能力方面的适用性和性能改进。

This paper presents a novel neural network training approach for faster convergence and better generalization abilities in deep reinforcement learning. Particularly, we focus on the enhancement of training and evaluation performance in reinforcement learning algorithms by systematically reducing gradient's variance and thereby providing a more targeted learning process. The proposed method which we term as Gradient Monitoring(GM), is an approach to steer the learning in the weight parameters of a neural network based on the dynamic development and feedback from the training process itself. We propose different variants of the GM methodology which have been proven to increase the underlying performance of the model. The one of the proposed variant, Momentum with Gradient Monitoring (M-WGM), allows for a continuous adjustment of the quantum of back-propagated gradients in the network based on certain learning parameters. We further enhance the method with Adaptive Momentum with Gradient Monitoring (AM-WGM) method which allows for automatic adjustment between focused learning of certain weights versus a more dispersed learning depending on the feedback from the rewards collected. As a by-product, it also allows for automatic derivation of the required deep network sizes during training as the algorithm automatically freezes trained weights. The approach is applied to two discrete (Multi-Robot Co-ordination problem and Atari games) and one continuous control task (MuJoCo) using Advantage Actor-Critic (A2C) and Proximal Policy Optimization (PPO) respectively. The results obtained particularly underline the applicability and performance improvements of the methods in terms of generalization capability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题