MobileTL：带有倒残留块的设备转移学习

论文标题

MobileTL：带有倒残留块的设备转移学习

MobileTL: On-device Transfer Learning with Inverted Residual Blocks

论文作者

Chiang, Hung-Yueh, Frumkin, Natalia, Liang, Feng, Marculescu, Diana

论文摘要

由于设备有限的资源，在Edge上的转移学习具有挑战性。现有工作通过培训参数的子集或添加模型补丁来解决此问题。考虑到推断，倒置的残留块（IRB）将卷积层分为深度和方向的卷积，从而导致更多堆叠层，例如卷积，归一化和激活层。尽管它们有效地推断，但IRB要求将额外的激活图存储在记忆中，以备验证层的训练权重，用于归一化层的卷积层和尺度。结果，他们的高内存成本禁止在资源有限的边缘设备上进行培训IRB，并使它们在转移学习的背景下不适合。为了解决这个问题，我们提出了MobileTl，这是一种用于IRB构建的模型的内存和计算高效的在设备转移学习方法。 MobileTl训练内部归一化层的变化，以避免为向后传递存储激活图。此外，MobileTl近似激活层的向后计算（例如，硬智和relu6）作为签名函数，该功能可以存储二进制掩码而不是激活图作为向后通行证。 MobileTl微型调节几个顶部块（接近输出），而不是通过整个网络传播梯度以降低计算成本。对于MobilenetV2和V3 IRB，我们的方法将记忆使用率分别降低了46％和53％。对于MobilenetV3，我们观察到5个块时，我们观察到浮点操作（FLOPS）的降低36％，而仅在CIFAR10上降低了0.6％的精度。在多个数据集上进行的广泛实验表明，与Edge设备的转移学习中的先前工作相比，我们的方法是帕累托最佳（在给定硬件约束下的最佳精度）。

Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题