朝着理论上启发的神经初始化优化

论文标题

朝着理论上启发的神经初始化优化

Towards Theoretically Inspired Neural Initialization Optimization

论文作者

Yang, Yibo, Wang, Hong, Yuan, Haobo, Lin, Zhouchen

论文摘要

自动化的机器学习已被广泛探索，以减少人类在设计神经体系结构和寻找适当的超参数方面的努力。然而，在神经初始化的领域中，很少研究类似的自动化技术。大多数现有的初始化方法都是手工制作的，并且高度依赖于特定的体系结构。在本文中，我们提出了一个名为Gradcosine的可微分数量，并具有评估神经网络的初始状态的理论见解。具体而言，Gradcosine是相对于初始化参数的样品梯度的余弦相似性。通过分析样本优化领域，我们表明，可以通过在梯度规范约束下最大化Gradcosine来改善网络的训练和测试性能。基于此观察结果，我们进一步提出了神经初始化优化（NIO）算法。与培训时间相比，NIO从样本分析概括为真实批处理设置，能够自动寻找更好的初始化，而成本可忽略不计。使用NIO，我们改善了CIFAR-10，CIFAR-100和Imagenet上各种神经体系结构的分类性能。此外，我们发现我们的方法甚至可以帮助训练大型视觉变形金刚在没有热身的情况下进行训练。

Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题