根本没有任何一无所有的连贯性：测量梯度对齐的演变

论文标题

根本没有任何一无所有的连贯性：测量梯度对齐的演变

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

论文作者

Chatterjee, Satrajit, Zielinski, Piotr

论文摘要

我们提出了一个新的度量标准（$ m $ - conerence），以实验研究训练期间每个示例梯度的对齐。直观地，给定尺寸$ m $的样本，$ m $ - coherence是样本中的示例数量，这些示例平均得益于任何一个示例的梯度的一小步。我们表明，与其他常用的指标相比，$ m $ - coherence更容易解释，更便宜（$ o（m）$而不是$ o（m^2）$）和数学上的清洁剂。（我们注意到，$ m $ - 共晶与梯度多样性密切相关，这是以前在某些理论上使用的数量。）我们使用$ m $ - 综合性，我们研究Resnet和Inspection模型在Imagenet上的每个epcemample梯度对齐的演变，并且在Imagenet上的成立模型以及具有标记噪声的几种变体，尤其是在近期预测的explient exppection exppection exppection（尤其是概述）（CG）（CG）（CG）（CG）（CG）（CG）（CG）（CG）记忆和概括[Chatterjee，ICLR 20]。尽管我们有几个有趣的收获，但我们最令人惊讶的结果涉及记忆。天真的人可能希望，当使用完全随机的标签训练时，每个示例都独立安装，因此$ m $ - 合并应接近1。但是，情况并非如此：$ m $ - 合并性在训练期间达到更高的值（100s），表明过度参数化的神经网络在常规化的情况下都发现了常见的模式。对该现象的详细分析既可以对CG进行更深入的确认，但是在同一时刻，该理论中缺少的内容急剧缓解，以便提供神经网络中的普遍性的完整解释。

We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training. Intuitively, given a sample of size $m$, $m$-coherence is the number of examples in the sample that benefit from a small step along the gradient of any one example on average. We show that compared to other commonly used metrics, $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m^2)$) and mathematically cleaner. (We note that $m$-coherence is closely connected to gradient diversity, a quantity previously used in some theoretical bounds.) Using $m$-coherence, we study the evolution of alignment of per-example gradients in ResNet and Inception models on ImageNet and several variants with label noise, particularly from the perspective of the recently proposed Coherent Gradients (CG) theory that provides a simple, unified explanation for memorization and generalization [Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most surprising result concerns memorization. Naively, one might expect that when training with completely random labels, each example is fitted independently, and so $m$-coherence should be close to 1. However, this is not the case: $m$-coherence reaches much higher values during training (100s), indicating that over-parameterized neural networks find common patterns even in scenarios where generalization is not possible. A detailed analysis of this phenomenon provides both a deeper confirmation of CG, but at the same point puts into sharp relief what is missing from the theory in order to provide a complete explanation of generalization in neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题