论文标题

XDOC:统一的预培训,用于跨格式文档理解

XDoc: Unified Pre-training for Cross-Format Document Understanding

论文作者

Chen, Jingye, Lv, Tengchao, Cui, Lei, Zhang, Cha, Wei, Furu

论文摘要

预训练的激增最近见证了文件理解的快速发展。预培训和微调框架已有效地用于以各种格式处理文本,包括纯文本,文档文本和网络文本。尽管实现了有希望的性能,但现有的预培训模型通常一次针对一种特定的文档格式,因此很难结合多个文档格式的知识。为了解决这个问题,我们提出了XDOC,XDOC是一个统一的预训练模型,涉及单个模型中不同文档格式。对于参数效率,我们共享不同格式的骨干参数,例如嵌入层和变压器层。同时,我们引入具有轻质参数的自适应层,以增强不同格式的区别。实验结果表明,与单个预训练的模型相比,在各种下游任务上,只有36.7%的参数可以在各种下游任务上具有可比性甚至更好的性能,这对于现实世界的部署具有成本效益。代码和预培训模型将在\ url {https://aka.ms/xdoc}上公开可用。

The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult to combine knowledge from multiple document formats. To address this, we propose XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. The code and pre-trained models will be publicly available at \url{https://aka.ms/xdoc}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源