几乎确定神经网络辍学算法的收敛

论文标题

几乎确定神经网络辍学算法的收敛

Almost Sure Convergence of Dropout Algorithms for Neural Networks

论文作者

Senen-Cerda, Albert, Sanders, Jaron

论文摘要

我们研究了受辍学启发的神经网络（NNS）的随机训练算法的收敛和收敛率（Hinton等，2012）。为了避免在训练NNS期间过度拟合，辍学算法依靠将NN的重量矩阵组成，以通过$ \ {0，1 \} $的独立绘制随机矩阵，在随机梯度下降的每次迭代期间（SGD）。本文提出了一个概率理论证明，即对于具有可区别的多项式激活函数的完全连接的NN，如果我们使用辍学算法时将重量投射到紧凑型集合上，则NN的重量会收敛到普通微分方程的投影系统的独特平稳点（ODES）。在这种一般的收敛保证之后，我们继续研究辍学的收敛速率。首先，我们获得了使用SGD的$ε$ - 固定点的通用样品复杂性界限，并使用SGD和辍学的sgd函数明确取决于辍学概率。其次，我们获得了梯度下降（GD）收敛速率的上限，该梯度下降算法的限制ODES具有任意深度和线性激活函数的Arborescences的脱落算法。后者的结合表明，对于诸如辍学或落下连接等算法（Wan等，2013），收敛速率可能会因树木深度而呈指数损害。相比之下，我们在实验上观察到只有几个辍学层的宽NN的依赖性。我们还为这一观察提供了一个启发式论点。我们的结果表明，与其深度相比，辍学概率在收敛速率中的效果的变化是取决于NN宽度的相对大小的变化。

We investigate the convergence and convergence rate of stochastic training algorithms for Neural Networks (NNs) that have been inspired by Dropout (Hinton et al., 2012). With the goal of avoiding overfitting during training of NNs, dropout algorithms consist in practice of multiplying the weight matrices of a NN componentwise by independently drawn random matrices with $\{0, 1 \}$-valued entries during each iteration of Stochastic Gradient Descent (SGD). This paper presents a probability theoretical proof that for fully-connected NNs with differentiable, polynomially bounded activation functions, if we project the weights onto a compact set when using a dropout algorithm, then the weights of the NN converge to a unique stationary point of a projected system of Ordinary Differential Equations (ODEs). After this general convergence guarantee, we go on to investigate the convergence rate of dropout. Firstly, we obtain generic sample complexity bounds for finding $ε$-stationary points of smooth nonconvex functions using SGD with dropout that explicitly depend on the dropout probability. Secondly, we obtain an upper bound on the rate of convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms for NNs with the shape of arborescences of arbitrary depth and with linear activation functions. The latter bound shows that for an algorithm such as Dropout or Dropconnect (Wan et al., 2013), the convergence rate can be impaired exponentially by the depth of the arborescence. In contrast, we experimentally observe no such dependence for wide NNs with just a few dropout layers. We also provide a heuristic argument for this observation. Our results suggest that there is a change of scale of the effect of the dropout probability in the convergence rate that depends on the relative size of the width of the NN compared to its depth.

下载PDF全文

下载文献需遵守相关版权规定

论文标题