微调会扭曲预处理的功能，表现不佳

论文标题

微调会扭曲预处理的功能，表现不佳

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

论文作者

Kumar, Ananya, Raghunathan, Aditi, Jones, Robbie, Ma, Tengyu, Liang, Percy

论文摘要

当将验证的模型转移到下游任务时，两种流行的方法是完整的微调（更新所有模型参数）和线性探测（仅更新最后一个线性层 - “ head”）。众所周知，微型调整会导致更好的精度分布（ID）。但是，在本文中，我们发现，当审慎的特征良好并且分布偏移较大时，微型调整可以比线性探测到分布（OOD）更差。在10个分销偏移数据集（Breeds-Living17，Breeds-entity30，domainnet，cifar $ \ to $ stl，cifar10.1，fmow，imagenetv2，imagenetV2，imagenet-r，imagenet-a，imagenet-a，imagenet-aketch a，imagenet-sketch），平均准确性ID的精确度较低，但要比line od较低7％。从理论上讲，即使在简单的环境中，ID和OOD精度之间的这种权衡也会出现：微调过度参数化的两层线性网络。我们证明，当我们用固定或随机的头部初始化微调的OOD误差很高 - 这是因为当微调学习头部时，神经网络的下层同时改变并扭曲了预验证的特征。我们的分析表明，线性探测的简单两步策略，然后全面调查（LP-FT）有时用作微调启发式，结合了微调和线性探测的益处。从经验上讲，LP-FT在上述数据集上的表现优于微调和线性探测（ID好1％，OOD比完整的微调好10％）。

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

下载PDF全文

下载文献需遵守相关版权规定

论文标题