论文标题
随机正交添加剂过滤器:深度神经网络消失/爆炸梯度的解决方案
Random orthogonal additive filters: a solution to the vanishing/exploding gradient of deep neural networks
论文作者
论文摘要
由于在90年代初期的认可(v/e)梯度问题困扰着神经网络培训(NNS),因此已经为克服这一障碍而付出了巨大的努力。但是,到目前为止,对V/E问题的明确解决方案仍然难以捉摸。在此手稿中,提出了新的NN架构,旨在数学上防止发生V/E问题。追求近似动态等轴测图,即投入输出输入jacobian的奇异值的参数配置紧密分布在1左右,这导致了NN架构的推导,该体系结构与流行的残留网络模型共享共同的特征。这个想法不是跳过层之间的连接,而是要彻底过滤先前的激活,并将它们添加到下一层的非线性激活中,并意识到它们之间的凸组合。值得注意的是,即使在无限深度案例中,梯度更新消失或爆炸的可能性都可以证明。通过反向传播训练该方法的有效性在经验上得到了证明,这是50k层的极深的多层感知器,以及Elman nn在过去的10K时间步骤的输入中学习长期依赖性。与专门设计用于V/E问题的其他架构相比,例如RECIRTRENT NNS的LSTMS,提出的模型更简单,更有效。令人惊讶的是,可以增强单层香草RNN以达到最先进的表现,同时超级融合。例如,在PSMNIST任务上,在第一个时期内可以获得超过94%的测试准确性,而仅在10个时期之后就可以获得98%以上。
Since the recognition in the early nineties of the vanishing/exploding (V/E) gradient issue plaguing the training of neural networks (NNs), significant efforts have been exerted to overcome this obstacle. However, a clear solution to the V/E issue remained elusive so far. In this manuscript a new architecture of NN is proposed, designed to mathematically prevent the V/E issue to occur. The pursuit of approximate dynamical isometry, i.e. parameter configurations where the singular values of the input-output Jacobian are tightly distributed around 1, leads to the derivation of a NN's architecture that shares common traits with the popular Residual Network model. Instead of skipping connections between layers, the idea is to filter the previous activations orthogonally and add them to the nonlinear activations of the next layer, realising a convex combination between them. Remarkably, the impossibility for the gradient updates to either vanish or explode is demonstrated with analytical bounds that hold even in the infinite depth case. The effectiveness of this method is empirically proved by means of training via backpropagation an extremely deep multilayer perceptron of 50k layers, and an Elman NN to learn long-term dependencies in the input of 10k time steps in the past. Compared with other architectures specifically devised to deal with the V/E problem, e.g. LSTMs for recurrent NNs, the proposed model is way simpler yet more effective. Surprisingly, a single layer vanilla RNN can be enhanced to reach state of the art performance, while converging super fast; for instance on the psMNIST task, it is possible to get test accuracy of over 94% in the first epoch, and over 98% after just 10 epochs.