持续KD：通过延续优化的镜头改善知识蒸馏

论文标题

持续KD：通过延续优化的镜头改善知识蒸馏

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

论文作者

Jafari, Aref, Kobyzev, Ivan, Rezagholizadeh, Mehdi, Poupart, Pascal, Ghodsi, Ali

论文摘要

知识蒸馏（KD）已被广泛用于自然语言理解（NLU）任务，以通过从较大的模型（教师）转移知识来改善小型模型的概括。尽管KD方法在众多环境中达到了最新的性能，但它们却遇到了限制其性能的几个问题。文献中表明，教师和学生网络之间的容量差距可能使KD无效。此外，现有的KD技术不会减轻老师的噪音：建模教师的嘈杂行为可以分散学生学习更多有用的功能。我们提出了一种新的KD方法，该方法可以解决这些问题并促进与以前的技术相比的培训。受到持续优化的启发，我们设计了一个培训程序，该程序从该目标的平滑版本开始，并随着培训的进行而变得更加复杂，从而优化了高度非凸的KD目标。我们的方法（Continuation-KD）在NLU（胶水基准）和计算机视觉任务（CIFAR-10和CIFAR-100）上的各种紧凑型体系结构中实现了最先进的性能。

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).

下载PDF全文

下载文献需遵守相关版权规定

论文标题