论文标题
PATDNN:通过基于模式的重量修剪在移动设备上实现实时DNN执行
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
论文作者
论文摘要
随着高端移动设备的出现,许多以前需要的桌面级计算功能的应用程序正在传输到这些设备。但是,考虑到高度计算和存储需求,如果需要高精度的实时性能,则执行深神网络(DNN)的推断仍然具有挑战性。提出了DNN的重量修剪,但是现有的方案在设计空间中代表了两个极端:非结构化修剪是细粒度的,准确但不友好的;结构化的修剪是粗粒,硬件有效的,但精度损失更高。在本文中,我们在粗粒结构内引入了一种新的维度,细粒度的修剪模式,从而揭示了以前未知的设计空间点。通过细度修剪模式实现了更高的精度,独特的见解是使用编译器重新获得并保证高硬件效率。换句话说,我们的方法可以达到两全其美,并且在理论/算法,编译器和硬件级别之间是可取的。所提出的PATDNN是一个端到端框架,借助一种新型模型压缩技术(基于模式的基于图案的修剪,基于扩展的ADMM解决方案框架)和一组彻底的架构 - 意识到的编译器和代码生成的优化(滤光器重新订购,压缩元件重量储存,寄存器载荷载荷,登记式增长额和参数),并借助一组彻底的架构 - 意识到的编译器和代码生成。评估结果表明,PATDNN胜过三个最先进的端到端DNN框架,Tensorflow Lite,TVM和Alibaba Mobile神经网络,分别加速高达44.5倍,11.4倍和7.1倍,没有准确的损害。可以使用移动设备实现代表性大规模DNN(例如VGG-16,Resnet-50)的实时推断。
With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.