论文标题

在非结构化文本文档中结合深度学习和地址检测的推理

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

论文作者

Engelbach, Matthias, Klau, Dennis, Drawehn, Jens, Kintz, Maximilien

论文摘要

从非结构化的文本文档中提取信息是一项艰巨的任务,因为这些文档可以具有各种不同的布局和非平凡的阅读顺序,就像多列文档或嵌套表一样。此外,以纸质形式收到许多业务文件,这意味着在进一步分析之前需要将文本内容进行数字化。但是,自动检测和捕获关键文档信息(例如发件人地址)将提高许多公司的处理效率。在这项工作中,我们提出了一种混合方法,该方法将深度学习与从非结构化文本文档中查找和提取地址的理由相结合。我们使用视觉深度学习模型来检测扫描文档图像上可能地址区域的边界,并通过使用以基于规则的系统表示的域知识来分析包含文本的文本来验证这些结果。

Extracting information from unstructured text documents is a demanding task, since these documents can have a broad variety of different layouts and a non-trivial reading order, like it is the case for multi-column documents or nested tables. Additionally, many business documents are received in paper form, meaning that the textual contents need to be digitized before further analysis. Nonetheless, automatic detection and capturing of crucial document information like the sender address would boost many companies' processing efficiency. In this work we propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents. We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images and validate these results by analyzing the containing text using domain knowledge represented as a rule based system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源