位：稳健的二元式多动变压器

论文标题

位：稳健的二元式多动变压器

BiT: Robustly Binarized Multi-distilled Transformer

论文作者

Liu, Zechun, Oguz, Barlas, Pappu, Aasish, Xiao, Lin, Yih, Scott, Li, Meng, Krishnamoorthi, Raghuraman, Mehdad, Yashar

论文摘要

现代预训练的变压器已经快速提高了机器学习的最新技术，但也已经成长为参数和计算复杂性，使它们越来越难以在资源受限的环境中部署。从优化的角度来看，网络的权重和激活的二进化可以大大减轻这些问题的挑战。在这项工作中，我们确定了一系列的改进，该改进能够使二进制变压器的精度比以前的可能性高得多。其中包括一种两盘二进制方案，具有学习参数的新型弹性二进制激活函数，以及一种通过将较高的精度模型连续提炼成较低的精度学生，将网络量化为极限的方法。这些方法首次允许完全二进制的变压器模型，这些模型处于实际的准确性水平上，在胶水语言理解基准的基准范围内接近了几乎5.9％的基准。代码和模型可在以下网址提供：https：//github.com/facebookresearch/bit。

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at: https://github.com/facebookresearch/bit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题