论文标题
通过生成伪重播,可靠和资源有效的无数据知识蒸馏
Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay
论文作者
论文摘要
在没有原始培训数据的情况下,无数据知识蒸馏(KD)允许知识从训练有素的神经网络(教师)转移到更紧凑的(学生)。现有作品使用验证集来监视学生对实际数据的准确性,并在整个过程中报告最高的性能。但是,验证数据也可能在蒸馏时间也无法提供,这使记录达到峰准确性的学生快照是不可行的。因此,一种实用的无数据KD方法应该是鲁棒的,并且理想地在蒸馏过程中提供了单调提高的学生准确性。这很具有挑战性,因为由于合成数据的分布变化,学生会经历知识退化。克服此问题的一种直接方法是定期存储和排练生成的样品,这增加了内存足迹并引起了隐私问题。我们建议使用生成网络对先前观察到的合成样品的分布进行建模。特别是,我们设计了一个具有训练目标的变异自动编码器(VAE),该目标是为最佳学习合成数据表示的定制目标。该学生通过生成伪重播技术进行了排练,并由VAE产生的样品进行了排练。因此,在不存储任何样本的情况下,可以预防知识降解。图像分类基准的实验表明,我们的方法优化了蒸馏模型精度的预期值,同时消除了样品存储方法产生的大型内存开销。
Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.