论文标题
是什么使知识蒸馏中的“良好”数据增强 - 统计观点
What Makes a "Good" Data Augmentation in Knowledge Distillation -- A Statistical Perspective
论文作者
论文摘要
知识蒸馏(KD)是一种通用的神经网络培训方法,它使用教师模型来指导学生模型。现有作品主要是从网络输出端研究KD(例如,试图设计更好的KD损失功能),而很少有人试图从输入端理解它。特别是,它与数据增强(DA)的相互作用尚未得到充分理解。在本文中,我们问:为什么某些DA方案(例如Cutmix)固有的表现比KD中的其他方案要好得多?是什么使KD中的“好” DA?从统计的角度来看,我们的调查表明,良好的DA计划应降低教师跨境的协方差。实用的指标,即教师平均概率(T. STDDEV)的核心,并得到了进一步的证明和合理的证明。除了理论理解外,我们还引入了一种新的基于熵的数据混合DA方案CutMixPick,以进一步增强CutMix。广泛的经验研究支持我们的主张,并证明我们如何仅通过在知识蒸馏中使用更好的DA方案来收获可观的绩效提高。
Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy. A practical metric, the stddev of teacher's mean probability (T. stddev), is further presented and well justified empirically. Besides the theoretical understanding, we also introduce a new entropy-based data-mixing DA scheme, CutMixPick, to further enhance CutMix. Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation.