预先训练的变压器的极端压缩使简单有效

论文标题

预先训练的变压器的极端压缩使简单有效

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

论文作者

Wu, Xiaoxia, Yao, Zhewei, Zhang, Minjia, Li, Conglong, He, Yuxiong

论文摘要

已经提出了极端压缩，尤其是超低位精度（二进制/三元）量化，以适合资源构成设备上的大型NLP模型。但是，为了保留此类侵略性压缩方案的准确性，最先进的方法通常会引入复杂的压缩管道，例如，具有广泛的超参数调整，例如多阶段昂贵的知识蒸馏。同样，他们通常会少关注较小的变压器模型，这些模型已经通过知识蒸馏而严重压缩，并且缺乏系统的研究来显示其方法的有效性。在本文中，我们进行了一项非常全面的系统研究，以衡量以前作品中许多关键的超参数和培训策略的影响。结果，我们发现先前用于超低位精度量化的基准的训练明显不足。基于我们的研究，我们提出了一个简单而有效的压缩管道，用于极端压缩，名为XTC。 XTC证明（1）我们可以跳过前训练的知识蒸馏以获得5层BERT，同时比以前的最先进方法，例如6层Tinybert；（2）极端量化加上层减少能够将模型尺寸降低50倍，从而在胶水任务上产生新的最新结果。

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题