透视图：深度学习统一干扰，功能学习和懒惰培训的阶段图

论文标题

透视图：深度学习统一干扰，功能学习和懒惰培训的阶段图

Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training

论文作者

Geiger, Mario, Petrini, Leonardo, Wyart, Matthieu

论文摘要

深度学习算法负责各种任务的技术革命，包括图像识别或播放。但是，为什么他们的工作不理解。最终，他们设法对位于高维度的数据进行分类 - 由于高维空间的几何形状和维数的相关诅咒，这是一般不可能的。了解什么样的结构，对称性或不变性可以使图像之类的数据可学习是一个基本挑战。其他难题包括（i）学习对应于最大程度地减少高维度的损失，这通常不是凸，很可能会陷入困境。（ii）深度学习预测功率随着拟合参数的数量而增加，即使在数据非常合适的情况下也是如此。在本手稿中，我们回顾了最新的结果（i，ii）以及它们对（仍然无法解释的）维度悖论的诅咒所提供的观点。我们将理论讨论基于$（H，α）$平面，其中$ h $是网络宽度，而$α$初始化时网络输出的规模，并为MNIST和CIFAR 10提供了新的系统性能。我们认为，不同的学习方案可以将不同的学习方案组织到相位图中。一系列临界点从过度参数的阶段急剧地划定了一个未被批准的阶段。在过度参数的网中，学习可以在两个平滑跨界区分开的两个方案中运作。在很大的初始化中，它对应于内核方法，而对于小型初始化特征，可以学习数据中的不变性。我们回顾了这些不同阶段的特性，将它们分开的过渡以及一些开放的问题。我们的治疗强调了与物理系统，比例参数和数值可观察物的开发，以定量对这些结果进行经验测试。

Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing. Yet, why they work is not understood. Ultimately, they manage to classify data lying in high dimension -- a feat generically impossible due to the geometry of high dimensional space and the associated curse of dimensionality. Understanding what kind of structure, symmetry or invariance makes data such as images learnable is a fundamental challenge. Other puzzles include that (i) learning corresponds to minimizing a loss in high dimension, which is in general not convex and could well get stuck bad minima. (ii) Deep learning predicting power increases with the number of fitting parameters, even in a regime where data are perfectly fitted. In this manuscript, we review recent results elucidating (i,ii) and the perspective they offer on the (still unexplained) curse of dimensionality paradox. We base our theoretical discussion on the $(h,α)$ plane where $h$ is the network width and $α$ the scale of the output of the network at initialization, and provide new systematic measures of performance in that plane for MNIST and CIFAR 10. We argue that different learning regimes can be organized into a phase diagram. A line of critical points sharply delimits an under-parametrised phase from an over-parametrized one. In over-parametrized nets, learning can operate in two regimes separated by a smooth cross-over. At large initialization, it corresponds to a kernel method, whereas for small initializations features can be learnt, together with invariants in the data. We review the properties of these different phases, of the transition separating them and some open questions. Our treatment emphasizes analogies with physical systems, scaling arguments and the development of numerical observables to quantitatively test these results empirically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题