从单线格式的计算机生成文档中提取的无监督数据提取

论文标题

从单线格式的计算机生成文档中提取的无监督数据提取

Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

论文作者

Bernstein, Vladimir, Afanassenkov, Andrei

论文摘要

处理大量数据是大数据时代的基本问题。大多数数据交换是通过直接通信（使用API）和结构良好的文件格式（JSON，XML，EDI等）完成的，但是数据的很大一部分是使用任意格式的计算机生成的文档（例如发票，购买订单，财务报告等）传输的，这些文档需要复杂的处理和人类干预来进行数据解释。当前可用的解决方案范围从手动数据输入到低级脚本和数据提取工具，都是昂贵的，需要人为干预。本文介绍了从广泛的计算机生成的文档中提取无监督，全自动数据提取的原理方法，假设它们的格式反映了数据源的原始结构。所提出的方法属于无监督的机器学习的类别，由三个主要部分组成：（1） - 通过使用相对特征空间群集和自适应加权分数图来检测文本格式的重复模式，（2） - 通过叠加和噪声滤波过程（通过汇总和噪声的形式）进行（2） - 通过汇总和噪声的形式进行互动的形式（3）提取工具（SIMX TextConverter）用于全自动处理。

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the data is transferred using arbitrary formatted computer-generated documents (such as invoices, purchase orders, financial reports, etc.), which require sophisticated processing and human intervention for data interpretation and extraction. The currently available solutions, ranging from manual data entry to low-level scripting and data extraction tools, are costly and require human intervention. This paper describes the principle methodology for unsupervised, fully automatic data extraction from a wide range of computer-generated documents, assuming that their formatting reflects the original structure of the data sources. The presented methodology falls into the category of unsupervised machine learning and consists of the three main parts: (1) - detecting repeating patterns of text formatting by employing the relative feature space clustering and adaptive weighted feature score maps, (2) - detecting hierarchical formatting structures via collapsing and noise filtering procedure applied to the repeating formatting patterns and (3) - automatic configuration of the interactive data extraction tool (SiMX TextConverter) for fully automated processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题