匹配的带导蒸馏

论文标题

匹配的带导蒸馏

Matching Guided Distillation

论文作者

Yue, Kaiyu, Deng, Jiangfan, Zhou, Feng

论文摘要

特征蒸馏是提高较小学生模型的性能的有效方法，该模型的参数较少，与较大的教师模型相比，其计算成本较低。不幸的是，有一个共同的障碍 - 教师和学生的中级特征之间语义特征结构的差距。经典方案更喜欢通过添加适应模块（例如天真的卷积，基于注意力或更复杂的模块）来改变中间特征。但是，这引入了两个问题：a）适应模块将更多参数带入培训中。 b）具有随机初始化或特殊转化的适应模块对于提炼预训练的学生并不友好。在本文中，我们将匹配的引导蒸馏（MGD）作为一种无参数的方式来解决这些问题。 MGD的关键思想是将教师渠道与学生的分配问题相匹配。我们比较了分配问题的三个解决方案，以减少从教师特征的渠道，并部分蒸馏损失。整体培训采用两个优化对象之间的坐标 - 分配方法 - 分配更新和参数更新。由于MGD仅包含可忽略不计的计算成本的归一化或汇总操作，因此可以灵活地使用其他蒸馏方法插入网络。

Feature distillation is an effective way to improve the performance for a smaller student model, which has fewer parameters and lower computation cost compared to the larger teacher model. Unfortunately, there is a common obstacle - the gap in semantic feature structure between the intermediate features of teacher and student. The classic scheme prefers to transform intermediate features by adding the adaptation module, such as naive convolutional, attention-based or more complicated one. However, this introduces two problems: a) The adaptation module brings more parameters into training. b) The adaptation module with random initialization or special transformation isn't friendly for distilling a pre-trained student. In this paper, we present Matching Guided Distillation (MGD) as an efficient and parameter-free manner to solve these problems. The key idea of MGD is to pose matching the teacher channels with students' as an assignment problem. We compare three solutions of the assignment problem to reduce channels from teacher features with partial distillation loss. The overall training takes a coordinate-descent approach between two optimization objects - assignments update and parameters update. Since MGD only contains normalization or pooling operations with negligible computation cost, it is flexible to plug into network with other distillation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题