利用新的西班牙语料库进行多语言和跨语言隐喻检测

论文标题

利用新的西班牙语料库进行多语言和跨语言隐喻检测

Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection

论文作者

Sanchez-Bayona, Elisa, Agerri, Rodrigo

论文摘要

缺乏对英语以外的其他语言的日常隐喻表达式注释的广泛覆盖范围数据集令人震惊。这意味着大多数有关监督隐喻检测的研究仅针对该语言发表。为了解决此问题，这项工作介绍了第一个注释的语料库，其中有天然存在的隐喻在西班牙语中，足以开发系统以执行隐喻检测。介绍的数据集Cometa包括来自各个领域的文本，即新闻，政治话语，Wikipedia和评论。为了标记Cometa，我们应用了MIPVU方法，该指南最常用于系统地注释真实数据的隐喻。我们使用新创建的数据集来通过微调多种语言和单语的最先进的大语言模型来提供竞争基线。此外，通过利用现有的VUAM英语数据，除了Cometa之外，我们还提供了有关监督隐喻检测的首次跨语性实验。最后，我们执行了详细的错误分析，该分析探讨了在这两种语言和数据集中看似高的日常隐喻传递。

The lack of wide coverage datasets annotated with everyday metaphorical expressions for languages other than English is striking. This means that most research on supervised metaphor detection has been published only for that language. In order to address this issue, this work presents the first corpus annotated with naturally occurring metaphors in Spanish large enough to develop systems to perform metaphor detection. The presented dataset, CoMeta, includes texts from various domains, namely, news, political discourse, Wikipedia and reviews. In order to label CoMeta, we apply the MIPVU method, the guidelines most commonly used to systematically annotate metaphor on real data. We use our newly created dataset to provide competitive baselines by fine-tuning several multilingual and monolingual state-of-the-art large language models. Furthermore, by leveraging the existing VUAM English data in addition to CoMeta, we present the, to the best of our knowledge, first cross-lingual experiments on supervised metaphor detection. Finally, we perform a detailed error analysis that explores the seemingly high transfer of everyday metaphor across these two languages and datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题