使用条件生成对抗网络在文档中识别表结构

论文标题

使用条件生成对抗网络在文档中识别表结构

Identifying Table Structure in Documents using Conditional Generative Adversarial Networks

论文作者

Vine, Nataliya Le, Horn, Claus, Zeigenfuse, Matthew, Rowan, Mark

论文摘要

在许多行业以及学术研究中，信息主要以非结构化文件的形式传输（例如，本文）。层次相关的数据被渲染为表，从该文档中的表中提取信息提出了一个重大挑战。许多现有方法采用自下而上的方法，首先将线路集成到单元格中，然后将细胞分为行或列，最后从所得的2-D布局中推断出结构。但是，这种方法忽略了与表结构有关的可用先验信息，即表仅仅是潜在逻辑结构的任意表示。我们提出了一种自上而下的方法，首先使用条件生成对抗网络将表图像映射到标准化的“骨架”表形式中，该表格表示无需表内容的近似行和列边界，然后使用XY-CUT投影和遗传算法优化来衍生潜在的表结构。该方法很容易适应不同的表配置，并且需要训练的小数据集大小。

In many industries, as well as in academic research, information is primarily transmitted in the form of unstructured documents (this article, for example). Hierarchically-related data is rendered as tables, and extracting information from tables in such documents presents a significant challenge. Many existing methods take a bottom-up approach, first integrating lines into cells, then cells into rows or columns, and finally inferring a structure from the resulting 2-D layout. But such approaches neglect the available prior information relating to table structure, namely that the table is merely an arbitrary representation of a latent logical structure. We propose a top-down approach, first using a conditional generative adversarial network to map a table image into a standardised `skeleton' table form denoting approximate row and column borders without table content, then deriving latent table structure using xy-cut projection and Genetic Algorithm optimisation. The approach is easily adaptable to different table configurations and requires small data set sizes for training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题