论文标题

DOSA:通过人类在商务文档上加速注释的系统

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

论文作者

Shukla, Neelesh K, Raja, Msp, Katikeri, Raghu, Vaid, Amit

论文摘要

业务文件有各种结构,格式和信息需求,使信息提取成为具有挑战性的任务。由于这些变化,拥有可以在所有类型的文档中都可以很好地工作的文档通用模型,并且在所有用例中似乎都很牵强。对于特定于文档的模型,我们需要定制的文档特定标签。我们介绍DOSA(文档特定自动注释),该注释通过利用文档通用数据集和模型来帮助使用我们的新颖Bootstrap方法自动生成初始注释。这些最初的注释可以由人类进一步审查以确保正确性。可以训练初始文档的模型,并且可以将其推断用作产生更多自动注释的反馈。这些自动化注释可以通过人类的正确性来审查,以便在进行下一次迭代之前,可以使用当前模型作为预训练的模型对新的改进模型进行培训。在本文中,由于通用注释的数据集的可用性有限,我们的范围仅限于文档之类的形式,但是随着构建更多数据集,该想法可以扩展到其他文档。在Github https://github.com/neeleshkshukla/dosa上提供了开源的现成实现实现。

Business documents come in a variety of structures, formats and information needs which makes information extraction a challenging task. Due to these variations, having a document generic model which can work well across all types of documents and for all the use cases seems far-fetched. For document-specific models, we would need customized document-specific labels. We introduce DoSA (Document Specific Automated Annotations), which helps annotators in generating initial annotations automatically using our novel bootstrap approach by leveraging document generic datasets and models. These initial annotations can further be reviewed by a human for correctness. An initial document-specific model can be trained and its inference can be used as feedback for generating more automated annotations. These automated annotations can be reviewed by human-in-the-loop for the correctness and a new improved model can be trained using the current model as pre-trained model before going for the next iteration. In this paper, our scope is limited to Form like documents due to limited availability of generic annotated datasets, but this idea can be extended to a variety of other documents as more datasets are built. An open-source ready-to-use implementation is made available on GitHub https://github.com/neeleshkshukla/DoSA.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源