论文标题
加泰罗尼亚的顺序到序列资源
Sequence-to-Sequence Resources for Catalan
论文作者
论文摘要
在这项工作中,我们为加泰罗尼亚语(一种中等资源不足的语言)介绍了序列到序列的语言资源,即两个任务,即:摘要和机器翻译(MT)。我们在新闻领域介绍了两个新的抽象摘要数据集。我们还引入了平行的加泰罗尼亚英语语料库,并配上三个不同的全新测试套件。最后,我们评估了具有竞争状态模型的数据,并使用新创建的加泰罗尼亚巴特(Catalan Bart)开发了这些任务的基准。我们在开放许可下发布了这项工作的最终资源,以鼓励加泰罗尼亚语的语言技术发展。
In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new abstractive summarization datasets in the domain of newswire. We also introduce a parallel Catalan-English corpus, paired with three different brand new test sets. Finally, we evaluate the data presented with competing state of the art models, and we develop baselines for these tasks using a newly created Catalan BART. We release the resulting resources of this work under open license to encourage the development of language technology in Catalan.