论文标题
Vista:视觉变压器通过U-NET和图像彩色框架过滤增强了自动零售结帐
VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout
论文作者
论文摘要
多级产品计数和识别可以从图像或视频中标识自动零售结帐的产品项目。由于遮挡的现实情况,该任务是具有挑战性的,在这些情况下,产品项目重叠,传送带中快速移动,扫描项目的整体外观,新颖的产品以及误识别项目的负面影响。此外,培训和测试集之间存在域偏差,特别是所提供的培训数据集由合成图像组成,测试集视频由诸如手和托盘之类的异物组成。为了解决这些上述问题,我们建议从视频序列中细分和分类单个帧。分割方法由一个统一的单个产品项目和手部进行分割,然后是熵掩蔽,以解决域偏置问题。多类分类方法基于视觉变压器(VIT)。为了确定具有目标对象的帧,我们使用了几种图像处理方法,并提出了一个自定义指标来丢弃没有任何产品项的框架。将所有这些机制结合在一起,我们的最佳系统在AI City Challenge 2022 Track 4中获得第三名,F1得分为0.4545。代码将在
Multi-class product counting and recognition identifies product items from images or videos for automated retail checkout. The task is challenging due to the real-world scenario of occlusions where product items overlap, fast movement in the conveyor belt, large similarity in overall appearance of the items being scanned, novel products, and the negative impact of misidentifying items. Further, there is a domain bias between training and test sets, specifically, the provided training dataset consists of synthetic images and the test set videos consist of foreign objects such as hands and tray. To address these aforementioned issues, we propose to segment and classify individual frames from a video sequence. The segmentation method consists of a unified single product item- and hand-segmentation followed by entropy masking to address the domain bias problem. The multi-class classification method is based on Vision Transformers (ViT). To identify the frames with target objects, we utilize several image processing methods and propose a custom metric to discard frames not having any product items. Combining all these mechanisms, our best system achieves 3rd place in the AI City Challenge 2022 Track 4 with an F1 score of 0.4545. Code will be available at