MUCPAD：多域中文谓词题材数据集

论文标题

MUCPAD：多域中文谓词题材数据集

MuCPAD: A Multi-Domain Chinese Predicate-Argument Dataset

论文作者

Liu, Yahui, Yang, Haoping, Gong, Chen, Xia, Qingrong, Li, Zhenghua, Zhang, Min

论文摘要

在过去的十年中，神经网络模型在内域语义角色标签（SRL）方面取得了巨大进展。但是，在室外设置下，性能急剧下降。为了促进对跨域SRL的研究，本文介绍了Mucpad，这是一种多域中谓词 - 题目数据集，该数据集由30,897个句子和92,051个谓词组成，来自六个不同领域。 Mucpad展示了三个重要功能。 1）基于无框架注释方法，我们避免为新谓词编写复杂的帧。 2）我们明确注释省略的核心参数以恢复更完整的语义结构，因为省略内容词在多域中文中无处不在。 3）我们编译了53页注释指南，并采用严格的双重注释来改善数据质量。本文详细描述了MUCPAD的注释方法和注释过程，并介绍了深入的数据分析。我们还基于MUCPAD给出了跨域SRL的基准结果。

During the past decade, neural network models have made tremendous progress on in-domain semantic role labeling (SRL). However, performance drops dramatically under the out-of-domain setting. In order to facilitate research on cross-domain SRL, this paper presents MuCPAD, a multi-domain Chinese predicate-argument dataset, which consists of 30,897 sentences and 92,051 predicates from six different domains. MuCPAD exhibits three important features. 1) Based on a frame-free annotation methodology, we avoid writing complex frames for new predicates. 2) We explicitly annotate omitted core arguments to recover more complete semantic structure, considering that omission of content words is ubiquitous in multi-domain Chinese texts. 3) We compile 53 pages of annotation guidelines and adopt strict double annotation for improving data quality. This paper describes in detail the annotation methodology and annotation process of MuCPAD, and presents in-depth data analysis. We also give benchmark results on cross-domain SRL based on MuCPAD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题