多方对话中的多语言核心分辨率

论文标题

多方对话中的多语言核心分辨率

Multilingual Coreference Resolution in Multiparty Dialogue

论文作者

Zheng, Boyuan, Xia, Patrick, Yarmohammadi, Mahsa, Van Durme, Benjamin

论文摘要

实体核心分辨率的现有多方对话数据集是偏生的，许多挑战仍然没有解决。我们基于电视成绩单创建了此任务的大规模数据集，多语言多方CoreF（MMC）。由于使用多种语言的黄金质量字幕可用，我们建议重复注释以通过注释投影以其他语言（中文和FARSI）创建银色核心分辨率数据。在黄金（英语）数据上，现成的模型在MMC上的性能相对较差，这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上，我们发现成功使用它进行数据增强和从头开始训练，这有效地模拟了零射击的跨语性设置。

Existing multiparty dialogue datasets for entity coreference resolution are nascent, and many challenges are still unaddressed. We create a large-scale dataset, Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due to the availability of gold-quality subtitles in multiple languages, we propose reusing the annotations to create silver coreference resolution data in other languages (Chinese and Farsi) via annotation projection. On the gold (English) data, off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has broader coverage of multiparty coreference than prior datasets. On the silver data, we find success both using it for data augmentation and training from scratch, which effectively simulates the zero-shot cross-lingual setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题