统一通用文档处理的视觉，文本和布局

论文标题

统一通用文档处理的视觉，文本和布局

Unifying Vision, Text, and Layout for Universal Document Processing

论文作者

Tang, Zineng, Yang, Ziyi, Wang, Guoxin, Fang, Yuwei, Liu, Yang, Zhu, Chenguang, Zeng, Michael, Zhang, Cha, Bansal, Mohit

论文摘要

我们提出了通用文档处理（UDOP），这是一个基础文档AI模型，该模型将文本，图像和布局模式与各种任务格式（包括文档理解和生成）一起统一。 UDOP利用文本内容和文档图像之间的空间相关性，以模拟图像，文本和布局模式，具有一个均匀的表示形式。 UDOP借助新颖的视觉文本变压器，将预处理和多域下游任务统一为基于及时的序列生成方案。使用创新的自我监管的目标和不同标记的数据，在两个大规模的未标记文档Corpora上审议了UDOP。 UDOP还学会通过掩盖图像重建从文本和布局模式中生成文档图像。据我们所知，这是文档AI领域的第一次，一个模型同时实现了高质量的神经文档编辑和内容自定义。我们的方法在8个文档AI任务（例如，文档理解和质量请事）上设置了最新的数据域，例如财务报告，学术论文和网站。 UDOP在文档理解基准的排行榜上排名第一。

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题