论文标题
根本没有任何一无所有的连贯性:测量梯度对齐的演变
Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment
论文作者
论文摘要
我们提出了一个新的度量标准($ m $ - conerence),以实验研究训练期间每个示例梯度的对齐。直观地,给定尺寸$ m $的样本,$ m $ - coherence是样本中的示例数量,这些示例平均得益于任何一个示例的梯度的一小步。我们表明,与其他常用的指标相比,$ m $ - coherence更容易解释,更便宜($ o(m)$而不是$ o(m^2)$)和数学上的清洁剂。 (我们注意到,$ m $ - 共晶与梯度多样性密切相关,这是以前在某些理论上使用的数量。)我们使用$ m $ - 综合性,我们研究Resnet和Inspection模型在Imagenet上的每个epcemample梯度对齐的演变,并且在Imagenet上的成立模型以及具有标记噪声的几种变体,尤其是在近期预测的explient exppection exppection exppection(尤其是概述)(CG)(CG)(CG)(CG)(CG)(CG)(CG)(CG)记忆和概括[Chatterjee,ICLR 20]。尽管我们有几个有趣的收获,但我们最令人惊讶的结果涉及记忆。天真的人可能希望,当使用完全随机的标签训练时,每个示例都独立安装,因此$ m $ - 合并应接近1。但是,情况并非如此:$ m $ - 合并性在训练期间达到更高的值(100s),表明过度参数化的神经网络在常规化的情况下都发现了常见的模式。对该现象的详细分析既可以对CG进行更深入的确认,但是在同一时刻,该理论中缺少的内容急剧缓解,以便提供神经网络中的普遍性的完整解释。
We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training. Intuitively, given a sample of size $m$, $m$-coherence is the number of examples in the sample that benefit from a small step along the gradient of any one example on average. We show that compared to other commonly used metrics, $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m^2)$) and mathematically cleaner. (We note that $m$-coherence is closely connected to gradient diversity, a quantity previously used in some theoretical bounds.) Using $m$-coherence, we study the evolution of alignment of per-example gradients in ResNet and Inception models on ImageNet and several variants with label noise, particularly from the perspective of the recently proposed Coherent Gradients (CG) theory that provides a simple, unified explanation for memorization and generalization [Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most surprising result concerns memorization. Naively, one might expect that when training with completely random labels, each example is fitted independently, and so $m$-coherence should be close to 1. However, this is not the case: $m$-coherence reaches much higher values during training (100s), indicating that over-parameterized neural networks find common patterns even in scenarios where generalization is not possible. A detailed analysis of this phenomenon provides both a deeper confirmation of CG, but at the same point puts into sharp relief what is missing from the theory in order to provide a complete explanation of generalization in neural networks.