从模型检查点有效的知识蒸馏

论文标题

从模型检查点有效的知识蒸馏

Efficient Knowledge Distillation from Model Checkpoints

论文作者

Wang, Chaofei, Yang, Qisen, Huang, Rui, Song, Shiji, Huang, Gao

论文摘要

知识蒸馏是一种在大型和强大模型（教师）的监督下学习紧凑模型（学生）的有效方法。从经验上讲，教师和学生模型的表现之间存在很强的相关性，人们通常认为，高表现的老师是首选的。因此，从业者倾向于使用训练有素的网络或他们的合奏作为老师。在本文中，我们进行了一个有趣的观察，即中间模型，即训练过程中间的检查站，通常与完全融合的模型相比，通常是更好的教师，尽管前者的精度较低。更令人惊讶的是，来自相同训练轨迹的几个中间模型的弱快照集合可以胜过强大的独立训练和完全融合模型的合奏，当它们被用作教师时。我们表明，这种现象可以通过信息瓶颈原则部分解释：中间模型的特征表示可以具有有关输入的较高的共同信息，因此包含更多的“黑暗知识”以进行有效的蒸馏。我们进一步提出了一种最佳的中级教师选择算法，基于最大化与任务相关的总相互信息。实验验证其有效性和适用性。

Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we make an intriguing observation that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more "dark knowledge" for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题