FULLPACK：用于子字节的全矢量利用率量化通用CPU的推断

论文标题

FULLPACK：用于子字节的全矢量利用率量化通用CPU的推断

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

论文作者

Katebi, Hossein, Asadi, Navidreza, Goudarzi, Maziar

论文摘要

尽管先前的ART证明了子字节量化的准确性下降可忽略不计，其中权重和/或激活的代表不到8位，但CPU的流行SIMD指令并不能本地支持这些数据类型。虽然最近的方法（例如ULPPACK）已经在带有矢量单元的通用CPU上使用子字节量化，但它们在内存中和矢量寄存器中的子字节值之间忽略了几个空位，以避免操作过程中的邻居溢出。这导致记忆足迹和带宽效率低下和次优性能。在本文中，我们介绍了用于存储的内存布局，以及用于处理子字节（4-，2-或1位）模型的机制，这些模型利用内存中的所有位以及向量寄存器中的所有位用于实际数据。我们为gemv（一般矩阵 - 矢量乘法）在权重和不同数据类型激活之间的操作（例如8位激活和4位权重）之间的操作提供了计算内核。为了进行评估，我们扩展了TFLITE软件包并将我们的方法添加到其上，然后在周期精确的GEM5模拟器上运行模型，以比较每种方法的详细内存和CPU周期。我们将与其他九种在生产中积极使用的方法进行比较，包括Gemlowp，Ruy，XNNPACK和ULPPACK。此外，我们探讨了深度学习层的不同输入和输出大小对我们提出方法的性能的影响。实验结果显示，小型尺寸的0.96-2.1倍加速度，中高尺寸的1.2-6.7倍加速。将我们的建议应用于现实世界中的语音识别模型Mozilla Deepspeech，我们证明了我们的方法与最先进的方法相比，取决于所使用的位宽度，我们的方法实现了1.56-2.11倍的端到端速度。

Although prior art has demonstrated negligible accuracy drop in sub-byte quantization -- where weights and/or activations are represented by less than 8 bits -- popular SIMD instructions of CPUs do not natively support these datatypes. While recent methods, such as ULPPACK, are already using sub-byte quantization on general-purpose CPUs with vector units, they leave out several empty bits between the sub-byte values in memory and in vector registers to avoid overflow to the neighbours during the operations. This results in memory footprint and bandwidth-usage inefficiencies and suboptimal performance. In this paper, we present memory layouts for storing, and mechanisms for processing sub-byte (4-, 2-, or 1-bit) models that utilize all the bits in the memory as well as in the vector registers for the actual data. We provide compute kernels for the proposed layout for the GEMV (GEneral Matrix-Vector multiplication) operations between weights and activations of different datatypes (e.g., 8-bit activations and 4-bit weights). For evaluation, we extended the TFLite package and added our methods to it, then ran the models on the cycle-accurate gem5 simulator to compare detailed memory and CPU cycles of each method. We compare against nine other methods that are actively used in production including GEMLOWP, Ruy, XNNPack, and ULPPACK. Furthermore, we explore the effect of different input and output sizes of deep learning layers on the performance of our proposed method. Experimental results show 0.96-2.1x speedup for small sizes and 1.2-6.7x speedup for mid to large sizes. Applying our proposal to a real-world speech recognition model, Mozilla DeepSpeech, we proved that our method achieves 1.56-2.11x end-to-end speedup compared to the state-of-the-art, depending on the bit-width employed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题