论文标题
正交初始化的可证明的好处在优化深层线性网络中
Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
论文作者
论文摘要
选择基于梯度的深神经网络的优化初始参数值是深度学习系统中最有影响力的超参数选择之一,从而影响收敛时间和模型性能。然而,尽管有重大的经验和理论分析,但对不同初始化方案的具体影响,相对较少的证明。在这项工作中,我们分析了深线性网络中初始化的效果,并首次提供了严格的证据,即从正交组中绘制初始权重速度相对于标准高斯初始化,并加快了收敛的速度。我们表明,对于深层网络,有效收敛到正交初始化所需的宽度与深度无关,而有效收敛与高斯初始化的有效收敛所需的宽度在深度上线性地缩放。我们的结果表明,良好初始化的益处如何在整个学习过程中持续存在,这表明了通过根据动力学等轴测原理初始化非常深的非线性网络来初始化非常深的非线性网络发现的最新经验成功的解释。
The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.