朝着可部署的OCR模型用于指示语言

论文标题

朝着可部署的OCR模型用于指示语言

Towards Deployable OCR models for Indic languages

论文作者

Mathew, Minesh, Mondal, Ajoy, Jawahar, CV

论文摘要

识别文字或行图像上的文本，而无需子词的细分已成为印度语言文本识别的研究和开发的主流。使用Connectionist Perimal分类（CTC）对未分段的序列进行建模是最常用的无分割OCR方法。在这项工作中，我们对各种神经网络模型进行了全面的经验研究，该研究使用CTC将神经网络输出中的逐步预测转录为Unicode序列。这项研究是针对13种印度语言进行的，使用了每种语言约1000页的内部数据集。我们研究行与单词的选择作为识别单元，并使用合成数据来训练模型。我们将模型与流行的公开可用的OCR工具进行比较，用于端到端文档图像识别。我们采用我们的识别模型和现有文本细分工具的端到端管道优于13种语言中8种的这些公共OCR工具。我们还介绍了一个名为Mozhi的新公共数据集，用于印度语言。该数据集包含13种印度语言中的超过120万个带注释的单词图像（1.2万文本行）。我们的代码，经过训练的模型和Mozhi数据集将在http://cvit.iiit.ac.in/research/project/project/projects/cvit-project/

Recognition of text on word or line images, without the need for sub-word segmentation has become the mainstream of research and development of text recognition for Indian languages. Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR. In this work we present a comprehensive empirical study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence. The study is conducted for 13 Indian languages, using an internal dataset that has around 1000 pages per language. We study the choice of line vs word as the recognition unit, and use of synthetic data to train the models. We compare our models with popular publicly available OCR tools for end-to-end document image recognition. Our end-to-end pipeline that employ our recognition models and existing text segmentation tools outperform these public OCR tools for 8 out of the 13 languages. We also introduce a new public dataset called Mozhi for word and line recognition in Indian language. The dataset contains more than 1.2 million annotated word images (120 thousand text lines) across 13 Indian languages. Our code, trained models and the Mozhi dataset will be made available at http://cvit.iiit.ac.in/research/projects/cvit-projects/

下载PDF全文

下载文献需遵守相关版权规定

论文标题