复合物：从李克特量表数据中的词汇复杂性预测的新语料库

论文标题

复合物：从李克特量表数据中的词汇复杂性预测的新语料库

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

论文作者

Shardlow, Matthew, Cooper, Michael, Zampieri, Marcos

论文摘要

在给定目标人群中，预测哪些单词很难理解是许多NLP应用程序（例如文本简化）的重要步骤。此任务通常称为复杂的单词识别（CWI）。除少数例外外，先前的研究已将任务作为二进制分类任务，其中系统预测文本中一组目标词的复杂性值（复杂与非复杂性）。该选择是由于迄今为止使用二进制注释方案注释的所有CWI数据集的动机。我们的论文通过介绍第一个英语数据集来解决这种限制，以进行连续的词汇复杂性预测。我们使用5点李克特量表方案来注释来自三个来源/域的文本中的复杂单词：圣经，欧洲裔和生物医学文本。这导致了由大约7个注释者注释的每个注释的9,476个句子的语料库。

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题