论文标题
过度参数学习的框架
A Framework for Overparameterized Learning
论文作者
论文摘要
对深神经网络的良好经验表现的候选解释是一阶优化方法的隐式正则作用。受此启发的启发,我们证明了一种用于非convex复合优化的融合定理,并将其应用于涵盖许多机器学习应用程序(包括监督学习)的一般学习问题。然后,我们提出了一个深层的多层感知模型,并证明,如果足够宽,$(i)$会导致梯度下降到具有线性速率的全球最佳速率的融合,$(ii)$(ii)$(ii)从梯度下降的内在正规化效果中获得$(iii)$的益处,$(iii)$受到新颖的范围,$(IV)$(IV)$(IV)$(IV)$(IV)$(IV)$(iv)$(iv)$(IV)$(iv)。跨不同宽度转移。随着宽度的进一步增加,相应的系数(例如收敛速率)会提高,并取决于数据生成分布的均匀顺序矩,直至订单,具体取决于层的数量。我们做出的唯一非元假设是远离零的初始化时神经切线核的最小特征值的浓度,这已显示出对当代作品中许多不太一般模型的浓度。我们提供了支持这一假设以及我们的理论主张的经验证据。
A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.