论文标题
证明:用于自动出处验证知识图的管道针对文本源
ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources
论文作者
论文摘要
知识图是信息的存储库,它们以语义三元组的形式从多种域和来源收集数据,是现代Web环境中各种关键应用程序的结构化数据来源,从Wikipedia Infoboxes到搜索引擎。这样的图主要用作次要信息来源,并依靠有据可查和可验证的出处来确保其可信度和可用性。但是,它们有系统地评估和确保此出处质量的能力,最关键的是它是否适当支持图表的信息,主要依赖于没有规模扩展的手动过程。证明旨在纠正这一点,由管道的方法组成,该方法会自动验证知识图三倍是否由从其记录的出处中提取的文本支持。证明旨在协助信息策展人,并包括涉及基于规则的方法和机器学习模型的四个主要步骤:文本提取,三重言语,句子选择和索赔验证。在Wikidata数据集上评估了证明,在检测出来源的二进制分类任务上,取得了有希望的结果和出色的表现,其精度为87.5%,文本丰富的来源的F1-MaCro为82.9%。本文中使用的评估数据和脚本可在GitHub和Figshare上找到。
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph's information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with 87.5% accuracy and 82.9% F1-macro on text-rich sources. The evaluation data and scripts used in this paper are available on GitHub and Figshare.