通过变压器识别场景文本

论文标题

通过变压器识别场景文本

Scene Text Recognition via Transformer

论文作者

Feng, Xinjie, Yao, Hongxun, Qi, Yuankai, Zhang, Jun, Zhang, Shengping

论文摘要

由于文本形状，字体，颜色，背景等的巨大变化，具有任意形状的场景文本识别非常具有挑战性。大多数最新算法将输入图像纠正到归一化图像中，然后将识别视为序列预测任务。这种方法的瓶颈是纠正，这将由于失真的透视而导致错误。在本文中，我们发现纠正是完全不必要的。我们需要的只是空间的关注。因此，我们提出了一种基于变压器[50]的简单但极有效的场景文本识别方法。与以前的基于变压器的模型[56,34]不同，后者仅使用变压器的解码器来解码卷积注意力，而拟议的方法将卷积特征映射用作嵌入输入到变压器中的卷积特征。这样，我们的方法能够充分利用变压器的强大注意力机制。广泛的实验结果表明，所提出的方法在常规文本数据集和不规则文本数据集上的边距大大优于最先进的方法。在最具挑战性的可爱数据集中，其最先进的预测准确性为89.6％，我们的方法达到了99.3％，这是一个令人惊讶的结果。我们将发布我们的源代码，并相信我们的方法将是具有任意形状的场景文本识别的新基准。

Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full use of the powerful attention mechanism of the transformer. Extensive experimental results show that the proposed method significantly outperforms state-of-the-art methods by a very large margin on both regular and irregular text datasets. On one of the most challenging CUTE dataset whose state-of-the-art prediction accuracy is 89.6%, our method achieves 99.3%, which is a pretty surprising result. We will release our source code and believe that our method will be a new benchmark of scene text recognition with arbitrary shapes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题