SVTR：具有单个视觉模型的场景文本识别

论文标题

SVTR：具有单个视觉模型的场景文本识别

SVTR: Scene Text Recognition with a Single Visual Model

论文作者

Du, Yongkun, Chen, Zhineng, Jia, Caiyan, Yin, Xiaoting, Zheng, Tianlun, Li, Chenxia, Du, Yuning, Jiang, Yu-Gang

论文摘要

主观场景文本识别模型通常包含两个构建块，一个用于特征提取的视觉模型以及用于文本转录的序列模型。这种混合体系结构虽然准确，但很复杂且效率较低。在这项研究中，我们提出了一个单一的视觉模型，用于在斑块图像令牌框架内进行场景文本识别，该模型完全分配了顺序建模。该方法称为SVTR，首先将图像文本分解为名为“字符成分”的小补丁。之后，通过组件级混合，合并和/或组合，将分层阶段反复进行。设计了全局和局部混合块，以感知字符间和字符内模式，从而导致多透明的字符成分感知。因此，字符通过简单的线性预测识别。英语和中文场景文本识别任务的实验结果证明了SVTR的有效性。 SVTR-L（大）在英语中实现了高度竞争性的精度，并且在运行速度更快的同时，中文的优于现有方法的优于现有方法。此外，SVTR-T（TININE）是一个有效且较小的模型，它显示了推理时具有吸引力的速度。该代码可在https://github.com/paddlepaddle/paddleocr上公开获取。

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/PaddlePaddle/PaddleOCR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题