论文标题

businet-一个用于业务文档的轻快和快速的文本检测网络

BusiNet -- a Light and Fast Text Detection Network for Business Documents

论文作者

Naparstek, Oshri, Azulai, Ophir, Rotman, Daniel, Burshtein, Yevgeny, Staar, Peter, Barzelay, Udi

论文摘要

对于数字化或索引物理文档,光学特征识别(OCR)是从扫描文档中提取文本信息的过程,这是一项重要技术。当文档在视觉上损坏或包含非文本元素时,现有技术可能会产生差的结果,因为错误的检测结果可能会极大地影响OCR的质量。在本文中,我们提出了一个针对商务文件的businet的检测网络。业务文件通常包括敏感信息,因此无法将其上传到OCR的云服务。 Businet被设计为快速和轻巧,因此可以在本地避免使用隐私问题。此外,Businet旨在使用专门的合成数据集来处理扫描的文档损坏和噪声。通过采用对抗性训练策略,该模型可实现可观的噪音。我们对可公开可用的数据集进行评估,以证明我们的模型的有用性和广泛适用性。

For digitizing or indexing physical documents, Optical Character Recognition (OCR), the process of extracting textual information from scanned documents, is a vital technology. When a document is visually damaged or contains non-textual elements, existing technologies can yield poor results, as erroneous detection results can greatly affect the quality of OCR. In this paper we present a detection network dubbed BusiNet aimed at OCR of business documents. Business documents often include sensitive information and as such they cannot be uploaded to a cloud service for OCR. BusiNet was designed to be fast and light so it could run locally preventing privacy issues. Furthermore, BusiNet is built to handle scanned document corruption and noise using a specialized synthetic dataset. The model is made robust to unseen noise by employing adversarial training strategies. We perform an evaluation on publicly available datasets demonstrating the usefulness and broad applicability of our model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源