Trie ++：从视觉上丰富的文档中提取端到端的信息

论文标题

Trie ++：从视觉上丰富的文档中提取端到端的信息

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

论文作者

Cheng, Zhanzhan, Zhang, Peng, Li, Can, Liang, Qiao, Xu, Yunlu, Li, Pengfei, Pu, Shiliang, Niu, Yi, Wu, Fei

论文摘要

最近，由于其广泛的商业价值，从视觉丰富的文档（例如门票和简历）中自动提取信息已成为一个热门而重要的研究主题。大多数现有方法将此任务分为两个小节：从原始文档图像中获取纯文本的文本阅读部分以及用于提取密钥内容的信息提取部分。这些方法主要集中于改进第二个方法，同时忽略了这两个部分高度相关。本文提出了一个统一的端到端信息提取框架，可以从视觉上丰富的文档中提取框架，其中文本阅读和信息提取可以通过精心设计的多模式上下文块相互加强。具体而言，文本阅读部分提供了多模式功能，例如视觉，文本和布局功能。开发了多模式上下文块是为了融合生成的多模式特征，甚至是从预训练的语言模型中获得的先验知识，以提供更好的语义表示。信息提取部分负责使用融合上下文功能生成密钥内容。该框架可以以端到端的可训练方式进行训练，从而实现全球优化。更重要的是，我们将视觉丰富的文档定义为在两个维度（布局和文本类型）的四个类别中。对于每个文档类别，我们提供或推荐相应的基准，实验设置和强大的基准，以弥补该研究领域缺乏统一评估标准的问题。报告了对四种基准测试的广泛实验（从固定布局到可变布局，从完整的文本到半未结构的文本），证明了所提出的方法的有效性。数据，源代码和模型可用。

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题