论文标题

所有损失都相等:神经崩溃的观点

Are All Losses Created Equal: A Neural Collapse Perspective

论文作者

Zhou, Jinxin, You, Chong, Li, Xiao, Liu, Kangning, Liu, Sheng, Qu, Qing, Zhu, Zhihui

论文摘要

虽然交叉熵(CE)是训练深层神经网络进行分类任务的最常用损失,但已经开发出许多替代性损失,以获得更好的经验表现。其中,最好使用的仍然是一个谜,因为似乎有多种因素影响答案,例如数据集的属性,网络体系结构的选择等。本文通过检查深网的最后一层特征来研究损失函数的选择,从最近的一条线工作中汲取了灵感,表明CE的全球最佳解决方案和均方形误差(MSE)损失均表现出神经崩溃现象。也就是说,对于经过训练直到收敛的足够大的网络,(i)同一类崩溃到相应类平均值的所有特征,以及(ii)与不同类别相关的均值处于配置中,其成对距离都相等且最大化。我们扩展了这样的结果,并通过全球解决方案和景观分析表明,包括常用标签平滑(LS)和局灶性损失(FL)(FL)的广泛损失功能均表现出神经崩溃。因此,所有相关损失(即CE,LS,FL,MSE)在训练数据上产生同等特征。基于不受约束的功能模型假设,我们为LS损失提供了全球景观分析,要么为FL损失提供了局部景观分析,并表明(仅!)全球最小化器是神经崩溃的解决方案,而所有其他关键点是严格的鞍座,其HESSIAN HESSIAN在LS损失方面的全球范围范围内的HESSIAN表现为负面范围的负面曲率,要么在本地范围损失范围附近,要么在FL范围接近FL的范围。该实验进一步表明,从所有相关损失中获得的神经崩溃特征也会在测试数据上大致相同,但前提是该网络足够大且训练直到收敛。

While cross entropy (CE) is the most commonly used loss to train deep neural networks for classification tasks, many alternative losses have been developed to obtain better empirical performance. Among them, which one is the best to use is still a mystery, because there seem to be multiple factors affecting the answer, such as properties of the dataset, the choice of network architecture, and so on. This paper studies the choice of loss function by examining the last-layer features of deep networks, drawing inspiration from a recent line work showing that the global optimal solution of CE and mean-square-error (MSE) losses exhibits a Neural Collapse phenomenon. That is, for sufficiently large networks trained until convergence, (i) all features of the same class collapse to the corresponding class mean and (ii) the means associated with different classes are in a configuration where their pairwise distances are all equal and maximized. We extend such results and show through global solution and landscape analyses that a broad family of loss functions including commonly used label smoothing (LS) and focal loss (FL) exhibits Neural Collapse. Hence, all relevant losses(i.e., CE, LS, FL, MSE) produce equivalent features on training data. Based on the unconstrained feature model assumption, we provide either the global landscape analysis for LS loss or the local landscape analysis for FL loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions either in the global scope for LS loss or in the local scope for FL loss near the optimal solution. The experiments further show that Neural Collapse features obtained from all relevant losses lead to largely identical performance on test data as well, provided that the network is sufficiently large and trained until convergence.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源