论文标题
邻里蒸馏:关于非端到端蒸馏的好处
Neighbourhood Distillation: On the benefits of non end-to-end distillation
论文作者
论文摘要
背部传播的端到端培训是训练深神经网络的标准方法。但是,随着网络变得更深,更大,端到端培训变得更具挑战性:高度非凸模的模型很容易被卡在本地的Optima中,梯度信号在后传播过程中容易消失或爆炸,培训需要计算资源和时间。在这项工作中,我们建议在知识蒸馏的背景下脱离端到端范式。我们建议将其拆分为较小的子网络(也称为邻里),而不是端到端的模型蒸馏,然后将其独立训练。我们从经验上表明,在各种用例中,以非端到端方式蒸馏网络可能是有益的。首先,我们证明它通过利用并行性和较小网络的培训来加快知识蒸馏。其次,我们表明,独立蒸馏的社区可能会有效地重新使用用于神经建筑搜索。最后,由于较小的网络模型更简单的功能,我们表明它们使用合成数据更容易训练它们,而不是更深的对应物。
End-to-end training with back propagation is the standard method for training deep neural networks. However, as networks become deeper and bigger, end-to-end training becomes more challenging: highly non-convex models gets stuck easily in local optima, gradients signals are prone to vanish or explode during back-propagation, training requires computational resources and time. In this work, we propose to break away from the end-to-end paradigm in the context of Knowledge Distillation. Instead of distilling a model end-to-end, we propose to split it into smaller sub-networks - also called neighbourhoods - that are then trained independently. We empirically show that distilling networks in a non end-to-end fashion can be beneficial in a diverse range of use cases. First, we show that it speeds up Knowledge Distillation by exploiting parallelism and training on smaller networks. Second, we show that independently distilled neighbourhoods may be efficiently re-used for Neural Architecture Search. Finally, because smaller networks model simpler functions, we show that they are easier to train with synthetic data than their deeper counterparts.