论文标题
通过正规镜下降明确正规化
Explicit Regularization via Regularizer Mirror Descent
论文作者
论文摘要
尽管完美地插入了训练数据,但深层神经网络(DNN)通常可以很好地概括,部分原因是学习算法引起的“隐式正则化”。尽管如此,各种形式的正则化(例如“显式正则化”(通过重量衰减)通常用于避免过度拟合,尤其是当数据损坏时。显式正则化存在一些挑战,最著名的是不清楚的收敛属性。受到随机镜下降(SMD)算法的收敛性能的启发,我们提出了一种使用正则化训练DNN的新方法,称为正规器镜下降(RMD)。在高度参数化的DNN中,SMD同时插值训练数据并最大程度地减少权重的一定潜在功能。 RMD始于标准成本,即训练损失的总和和重量的凸正规器。重新解释这一成本是“增强”过度参数化网络的潜力并应用SMD会产生RMD。结果,RMD继承了SMD的属性,并证明将其收敛到“接近”该成本的最小化器“接近”。 RMD在计算上与随机梯度下降(SGD)和权重衰减相当,并且以相同的方式平行。我们对具有各种腐败水平的训练集的实验结果表明,RMD的概括性能非常强大,并且比SGD和重量衰减都更好,这隐含,明确地将重量的$ \ ell_2 $正规化。 RMD也可以用来将重量与所需的权重矢量进行正规化,这与持续学习特别相关。
Despite perfectly interpolating the training data, deep neural networks (DNNs) can often generalize fairly well, in part due to the "implicit regularization" induced by the learning algorithm. Nonetheless, various forms of regularization, such as "explicit regularization" (via weight decay), are often used to avoid overfitting, especially when the data is corrupted. There are several challenges with explicit regularization, most notably unclear convergence properties. Inspired by convergence properties of stochastic mirror descent (SMD) algorithms, we propose a new method for training DNNs with regularization, called regularizer mirror descent (RMD). In highly overparameterized DNNs, SMD simultaneously interpolates the training data and minimizes a certain potential function of the weights. RMD starts with a standard cost which is the sum of the training loss and a convex regularizer of the weights. Reinterpreting this cost as the potential of an "augmented" overparameterized network and applying SMD yields RMD. As a result, RMD inherits the properties of SMD and provably converges to a point "close" to the minimizer of this cost. RMD is computationally comparable to stochastic gradient descent (SGD) and weight decay, and is parallelizable in the same manner. Our experimental results on training sets with various levels of corruption suggest that the generalization performance of RMD is remarkably robust and significantly better than both SGD and weight decay, which implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can also be used to regularize the weights to a desired weight vector, which is particularly relevant for continual learning.