语言语料库的出处通过纳米公开

论文标题

语言语料库的出处通过纳米公开

Provenance for Linguistic Corpora Through Nanopublications

论文作者

Lek, Timo, de Groot, Anna, Kuhn, Tobias, Morante, Roser

论文摘要

计算语言学的研究取决于用于培训和测试新工具和方法的文本语料库。尽管存在大量带注释的语言信息，但如果没有大量的体力劳动，这些语料库通常无法互操作。此外，这些注释可能已经演变为不同的版本，这使研究人员了解数据的出处是一项挑战。本文通过有关事件注释的语料库的案例研究解决了这个问题，并以纳米公开的形式对该数据进行了新的，更可互操作的表示。我们演示了从一开始就可以可靠地链接的语言注释如何可靠地链接，从而访问和查询，就好像它们是一个数据集一样。我们描述了如何创建此类纳米公开，并演示如何执行SPARQL查询以从新表示中提取有趣的内容。查询表明，由于不同语料库的信息以统一的数据格式表示，因此可以更容易，有效地检索多个语料库的信息。

Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies. While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work. Moreover, these annotations might have evolved into different versions, making it challenging for researchers to know the data's provenance. This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications. We demonstrate how linguistic annotations from separate corpora can be reliably linked from the start, and thereby be accessed and queried as if they were a single dataset. We describe how such nanopublications can be created and demonstrate how SPARQL queries can be performed to extract interesting content from the new representations. The queries show that information of multiple corpora can be retrieved more easily and effectively because the information of different corpora is represented in a uniform data format.

下载PDF全文

下载文献需遵守相关版权规定

论文标题