CNN转换器混合模型用于对象检测

论文标题

CNN转换器混合模型用于对象检测

CNN-transformer mixed model for object detection

论文作者

Li, Wenshuo

论文摘要

对象检测是计算机视觉的三个主要任务之一，已在各种应用中使用。主要过程是使用深层神经网络提取图像的功能，然后使用功能来识别对象的类和位置。因此，提高对象检测任务准确性的主要方向是改善神经网络以更好地提取功能。在本文中，我提出了一个带有变压器[1]的卷积模块，该模块旨在通过将CNN [2]与变压器提取的全局特征融合在一起，从而提高模型的识别精度，并通过缩放功能图来显着减少变压器模块的计算工作。主要的执行步骤是卷积下采样，以减少特征映射的大小，然后自我注意计算和提升采样，最后与初始输入串联。在实验部分中，将块固定到Yolov5n [3]的末端[3]和可可数据集上的300个时期后，与以前的Yolov5n相比，地图提高了1.7％，并且MAP曲线没有显示出任何饱和现象，因此仍有进步的潜力。在Pascal VOC数据集上进行了100轮训练后，结果的准确性达到了81％，使用RESNET101 [5]作为骨干，比更快的RCNN [4]好4.6，但参数的数量小于其中的二二。

Object detection, one of the three main tasks of computer vision, has been used in various applications. The main process is to use deep neural networks to extract the features of an image and then use the features to identify the class and location of an object. Therefore, the main direction to improve the accuracy of object detection tasks is to improve the neural network to extract features better. In this paper, I propose a convolutional module with a transformer[1], which aims to improve the recognition accuracy of the model by fusing the detailed features extracted by CNN[2] with the global features extracted by a transformer and significantly reduce the computational effort of the transformer module by deflating the feature mAP. The main execution steps are convolutional downsampling to reduce the feature map size, then self-attention calculation and upsampling, and finally concatenation with the initial input. In the experimental part, after splicing the block to the end of YOLOv5n[3] and training 300 epochs on the coco dataset, the mAP improved by 1.7% compared with the previous YOLOv5n, and the mAP curve did not show any saturation phenomenon, so there is still potential for improvement. After 100 rounds of training on the Pascal VOC dataset, the accuracy of the results reached 81%, which is 4.6 better than the faster RCNN[4] using resnet101[5] as the backbone, but the number of parameters is less than one-twentieth of it.

下载PDF全文

下载文献需遵守相关版权规定

论文标题