论文标题
MKQ-BERT:量化4位重量和激活的BERT
MKQ-BERT: Quantized BERT with 4-bits Weights and Activations
论文作者
论文摘要
最近,在许多自然语言处理(NLP)任务中,基于预训练的基于变压器的语言模型(例如BERT)表现出了极大的优势。但是,部署这些模型的计算成本在资源限制的设备上是过分的。减轻该计算开销的一种方法是将原始模型量化为更少的位表示,并且先前的工作证明,我们最多可以将BERT的权重和激活量化为8位,而不会降低其性能。在这项工作中,我们提出了MKQ-bert,该MKQ-Bert进一步提高了压缩水平并使用4位进行量化。在MKQ-BERT中,我们提出了一种计算量化量表梯度的新方法,并结合了先进的蒸馏策略。一方面,我们证明MKQ-Bert优于现有的BERT量化方法,用于在相同的压缩水平下实现更高精度。另一方面,我们是成功部署4位Bert并实现推理端到端的第一批工作。我们的结果表明,我们可以在不降低模型准确性的情况下实现5.3倍的钻头还原,并且一个INT4层的推理速度比基于变压器的模型中的Float32层快15倍。
Recently, pre-trained Transformer based language models, such as BERT, have shown great superiority over the traditional methods in many Natural Language Processing (NLP) tasks. However, the computational cost for deploying these models is prohibitive on resource-restricted devices. One method to alleviate this computation overhead is to quantize the original model into fewer bits representation, and previous work has proved that we can at most quantize both weights and activations of BERT into 8-bits, without degrading its performance. In this work, we propose MKQ-BERT, which further improves the compression level and uses 4-bits for quantization. In MKQ-BERT, we propose a novel way for computing the gradient of the quantization scale, combined with an advanced distillation strategy. On the one hand, we prove that MKQ-BERT outperforms the existing BERT quantization methods for achieving a higher accuracy under the same compression level. On the other hand, we are the first work that successfully deploys the 4-bits BERT and achieves an end-to-end speedup for inference. Our results suggest that we could achieve 5.3x of bits reduction without degrading the model accuracy, and the inference speed of one int4 layer is 15x faster than a float32 layer in Transformer based model.