论文标题
Heatvit:硬件有效的自适应令牌修剪视觉变压器
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers
论文作者
论文摘要
虽然视觉变压器(VIT)在计算机视觉领域不断实现了新的里程碑,但其复杂的网络体系结构具有很高的计算和内存成本阻碍了他们在资源有限的边缘设备上的部署。在本文中,我们提出了一种称为Heatvit的硬件有效的图像自适应令牌修剪框架,以在嵌入式FPGA上有效而准确的VIT加速度。通过分析VIT中固有的计算模式,我们首先设计了一个有效的基于注意力的多头代币选择器,可以在变压器块之前逐渐插入,以动态识别和合并从输入图像中的非信息令牌。此外,我们通过添加微型控制逻辑来大量重用为骨干vit构建的现有硬件组件,通过添加微型控制逻辑来实现令牌选择器。为了提高硬件效率,我们进一步采用了8位固定点量化,并提出了对VIT中经常使用的非线性函数的量化误差的多项式近似。最后,我们提出了一种延迟感知的多阶段训练策略,以确定用于插入令牌选择器的变压器块,并优化插入的令牌选择器所需的(平均)修剪率,以提高模型准确性和硬件上的推断潜伏期。与现有的VIT修剪研究相比,在类似的计算成本下,Heatvit可以实现0.7%$ \ sim的精度$ 8.9%;尽管在类似的模型准确性下,Heatvit可以在Imagenet数据集中获得超过28.4%的$ \ sim $ 65.3%的计算降低,包括各种广泛使用的VIT,包括DEIT-T,DEIT-S,DEIT-S,DEIT-S,DEIT-B,LV-VIT-S和LV-VIT-M。与基线硬件加速器相比,我们在Xilinx ZCU102 FPGA上的Heatvit实现3.46 $ \ times $$ \ sim $ 4.89 $ \ times $ speedup。
While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.