深度学习与内核学习：损失景观几何形状的实证研究和神经切线内核的时间演变

论文标题

深度学习与内核学习：损失景观几何形状的实证研究和神经切线内核的时间演变

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

论文作者

Fort, Stanislav, Dziugaite, Gintare Karolina, Paul, Mansheej, Kharaghani, Sepideh, Roy, Daniel M., Ganguli, Surya

论文摘要

在适当初始化的宽网络中，较小的学习率将深层神经网络（DNN）转化为神经切线内核（NTK）机器，其训练动力学通过在初始化时网络的线性重量扩展而迅速地促进了训练动力。然而，标准培训的线性化差异很差。我们研究了非线性深网的训练动力学，损失格局的几何形状与数据依赖性NTK的时间演变之间的关系。我们通过对训练的大规模现象学分析，综合表征损失景观几何形状和NTK动力学的各种措施的综合措施。在多个神经体系结构和数据集中，我们发现这些不同的措施以高度相关的方式发展，从而揭示了深度学习过程的普遍情况。在这张照片中，深网训练表现出高度混乱的快速初始瞬变，在2至3个时期内，它决定了最终连接的低损耗盆地，其中包含训练的终点。在这种混乱的瞬态期间，NTK迅速变化，从训练数据中学习有用的功能，使其能够在小于3至4个时期内优于标准初始NTK的3倍。在这种快速混乱的瞬态之后，NTK在恒定速度下变化，其性能与训练时间的15％至45％的全网络培训相匹配。总体而言，我们的分析揭示了在训练时间内一组不同的指标之间存在着惊人的相关性，受到最初几个时期的快速混乱与稳定过渡的控制，这共同为发展更准确的深度学习理论带来了挑战和机会。

In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题