论文标题
迈向广泛的覆盖范围,名为“实体资源:许多不同语言的数据有效方法”
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
论文作者
论文摘要
并行语料库是提取多语言命名实体(MNE)资源的理想选择,即,名称的数据集翻译成多种语言。从并行语料库中提取MNE数据集的先前工作需要资源,例如大型单语言语料库或对水资源不足的语言不可用或表现不佳的单词对齐器。我们提出了CLC-BN,这是一种创建MNE资源的新方法,并将其应用于平行的圣经语料库,该语料库是1000多种语言的语料库。 CLC-BN从平行统计数据中学习神经音译模型,而无需任何其他双语资源,单词对准器或种子数据。实验结果表明,CLC-BN明显优于先前的工作。我们为1340种语言发布了MNE资源,并在下游任务中展示了其有效性:知识图扩展和双语词典诱导。
Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.