用于信息提取任务的webtables的注释语料库

论文标题

用于信息提取任务的webtables的注释语料库

An Annotated Corpus of Webtables for Information Extraction Tasks

论文作者

Macdonald, Erin, Barbosa, Denilson

论文摘要

信息提取是自然语言处理的一个经过充分研究的领域，并在Web搜索中应用程序和问题回答与给定上下文中所示的识别实体及其之间的关系的问题，通常是运行文本的段落的句子。鉴于任务的重要性，多年来已经策划了几个数据集和基准。但是，专注于单独运行的文本丢弃了在许多结构化文档中常见的表，并且在上下文中也同时同时发生（例如，表的同一行）。虽然最近有关于文献中表从表中提取的关系的论文，但它们的实验评估一直是在临时数据集中，因为缺乏标准基准。本文有助于缩小差距。我们介绍了一个带有Wikipedia的217,834表的注释框架和数据集，并使用参考知识图上的分类器和精心设计的查询进行了28个关系注释。然后将二进制分类器应用于生成的数据集以删除误报，从而导致平均注释精度为94％。最终的数据集是公开可用的同类数据集。

Information Extraction is a well-researched area of Natural Language Processing with applications in web search and question answering concerned with identifying entities and relationships between them as expressed in a given context, usually a sentence of a paragraph of running text. Given the importance of the task, several datasets and benchmarks have been curated over the years. However, focusing on running text alone leaves out tables which are common in many structured documents and in which pairs of entities also co-occur in context (e.g., the same row of the table). While there are recent papers on relation extraction from tables in the literature, their experimental evaluations have been on ad-hoc datasets for the lack of a standard benchmark. This paper helps close that gap. We introduce an annotation framework and a dataset of 217,834 tables from Wikipedia which are annotated with 28 relations, using both classifiers and carefully designed queries over a reference knowledge graph. Binary classifiers are then applied to the resulting dataset to remove false positives, resulting in an average annotation accuracy of 94%. The resulting dataset is the first of its kind to be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题