宽神经网络中内核进化的自洽动力领域理论

论文标题

宽神经网络中内核进化的自洽动力领域理论

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

论文作者

Bordelon, Blake, Pehlevan, Cengiz

论文摘要

我们分析了通过梯度流通过自洽动力场理论训练的无限宽度神经网络中的特征学习。我们构建了确定性动力学阶参数的集合，该参数是内部产物内核，用于在成对的时间点中，每一层中隐藏的单位激活和梯度，从而减少了通过训练对网络活动的描述。这些内核顺序参数共同定义了隐藏层激活分布，神经切线内核的演变以及因此输出预测。我们表明，现场理论推导通过张量程序恢复了从Yang和Hu（2021）获得的无限宽度特征学习网络的递归随机过程。对于深线性网络，这些内核满足一组代数矩阵方程。对于非线性网络，我们提供了一个交替的采样过程，以求助于内核顺序参数。我们提供了与各种近似方案（包括静态NTK近似，梯度独立性假设和领先顺序扰动理论）的比较，表明这些近似值中的每一个都可以在一般的自洽解决方案仍然提供准确的描述中分解。最后，我们在更现实的设置中提供了实验，这些实验表明，在固定特征学习强度下，CNN的损耗和内核动力学在CIFAR分类任务上保留在不同宽度上。

We analyze feature learning in infinite-width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel, and consequently output predictions. We show that the field theory derivation recovers the recursive stochastic process of infinite-width feature learning networks obtained from Yang and Hu (2021) with Tensor Programs . For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of CNNs at fixed feature learning strength is preserved across different widths on a CIFAR classification task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题