PATDNN：通过基于模式的重量修剪在移动设备上实现实时DNN执行

论文标题

PATDNN：通过基于模式的重量修剪在移动设备上实现实时DNN执行

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

论文作者

Niu, Wei, Ma, Xiaolong, Lin, Sheng, Wang, Shihao, Qian, Xuehai, Lin, Xue, Wang, Yanzhi, Ren, Bin

论文摘要

随着高端移动设备的出现，许多以前需要的桌面级计算功能的应用程序正在传输到这些设备。但是，考虑到高度计算和存储需求，如果需要高精度的实时性能，则执行深神网络（DNN）的推断仍然具有挑战性。提出了DNN的重量修剪，但是现有的方案在设计空间中代表了两个极端：非结构化修剪是细粒度的，准确但不友好的；结构化的修剪是粗粒，硬件有效的，但精度损失更高。在本文中，我们在粗粒结构内引入了一种新的维度，细粒度的修剪模式，从而揭示了以前未知的设计空间点。通过细度修剪模式实现了更高的精度，独特的见解是使用编译器重新获得并保证高硬件效率。换句话说，我们的方法可以达到两全其美，并且在理论/算法，编译器和硬件级别之间是可取的。所提出的PATDNN是一个端到端框架，借助一种新型模型压缩技术（基于模式的基于图案的修剪，基于扩展的ADMM解决方案框架）和一组彻底的架构 - 意识到的编译器和代码生成的优化（滤光器重新订购，压缩元件重量储存，寄存器载荷载荷，登记式增长额和参数），并借助一组彻底的架构 - 意识到的编译器和代码生成。评估结果表明，PATDNN胜过三个最先进的端到端DNN框架，Tensorflow Lite，TVM和Alibaba Mobile神经网络，分别加速高达44.5倍，11.4倍和7.1倍，没有准确的损害。可以使用移动设备实现代表性大规模DNN（例如VGG-16，Resnet-50）的实时推断。

With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题