应对规范细分的低资源挑战

论文标题

应对规范细分的低资源挑战

Tackling the Low-resource Challenge for Canonical Segmentation

论文作者

Mager, Manuel, Çetinoğlu, Özlem, Kann, Katharina

论文摘要

规范的形态分割包括将单词分为其标准化的词素。在这里，我们对培训数据有限时的任务方法感兴趣。我们将模型性能在模拟的低资源设置中的高资源语言的德语，英语和印度尼西亚语与在真正的低资源语言Popoluca和Tepehua的新数据集上进行的实验进行了比较。我们探索了两个新的任务模型，从形态产生的紧密相关领域借用：LSTM指针生成器和一个序列到序列模型，具有通过模仿学习训练的硬单调注意力。我们发现，在低资源环境中，小说在所有语言上的表现都高出高达11.4％的精度。但是，尽管所有语言的模拟低资源场景的准确性均超过50％，但对于真正的低资源语言Popoluca和Tepehua，我们的最佳模型分别仅获得37.4％和28.4％的精度。因此，我们得出的结论是，对于低资源语言而言，规范细分仍然是一项具有挑战性的任务。

Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题