论文标题
Tervit:有效的三元视觉变压器
TerViT: An Efficient Ternary Vision Transformer
论文作者
论文摘要
视觉变压器(VIT)在各种视觉任务中都表现出巨大的潜力,但是在资源受限的设备上部署时会遇到昂贵的计算和内存成本问题。在本文中,我们引入了三元视觉变压器(Tervit),以使VIT中的权重化,这受到了实值和三元参数之间较大的损失表面差距的挑战。为了解决这个问题,我们通过首次培训8位变压器,然后是Tervit引入了渐进培训计划,并获得了比常规方法更好的优化。此外,我们通过将每个矩阵将每个矩阵划分为不同的通道来介绍通道三元化,每个矩阵都具有独特的分布和三分法间隔。我们将方法应用于受欢迎的DEIT和SWIN骨架上,并且广泛的结果表明我们可以实现竞争性能。例如,Tervit可以将Swin-S定量为13.1MB型号的大小,同时在Imagenet数据集上达到超过79%的TOP-1精度。
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT can quantize Swin-S to 13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.