论文标题
预先训练的变压器的极端压缩使简单有效
Extreme Compression for Pre-trained Transformers Made Simple and Efficient
论文作者
论文摘要
已经提出了极端压缩,尤其是超低位精度(二进制/三元)量化,以适合资源构成设备上的大型NLP模型。但是,为了保留此类侵略性压缩方案的准确性,最先进的方法通常会引入复杂的压缩管道,例如,具有广泛的超参数调整,例如多阶段昂贵的知识蒸馏。同样,他们通常会少关注较小的变压器模型,这些模型已经通过知识蒸馏而严重压缩,并且缺乏系统的研究来显示其方法的有效性。在本文中,我们进行了一项非常全面的系统研究,以衡量以前作品中许多关键的超参数和培训策略的影响。结果,我们发现先前用于超低位精度量化的基准的训练明显不足。基于我们的研究,我们提出了一个简单而有效的压缩管道,用于极端压缩,名为XTC。 XTC证明(1)我们可以跳过前训练的知识蒸馏以获得5层BERT,同时比以前的最先进方法,例如6层Tinybert; (2)极端量化加上层减少能够将模型尺寸降低50倍,从而在胶水任务上产生新的最新结果。
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.