论文标题
复合物:从李克特量表数据中的词汇复杂性预测的新语料库
CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data
论文作者
论文摘要
在给定目标人群中,预测哪些单词很难理解是许多NLP应用程序(例如文本简化)的重要步骤。此任务通常称为复杂的单词识别(CWI)。除少数例外外,先前的研究已将任务作为二进制分类任务,其中系统预测文本中一组目标词的复杂性值(复杂与非复杂性)。该选择是由于迄今为止使用二进制注释方案注释的所有CWI数据集的动机。我们的论文通过介绍第一个英语数据集来解决这种限制,以进行连续的词汇复杂性预测。我们使用5点李克特量表方案来注释来自三个来源/域的文本中的复杂单词:圣经,欧洲裔和生物医学文本。这导致了由大约7个注释者注释的每个注释的9,476个句子的语料库。
Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.