论文标题
应对规范细分的低资源挑战
Tackling the Low-resource Challenge for Canonical Segmentation
论文作者
论文摘要
规范的形态分割包括将单词分为其标准化的词素。在这里,我们对培训数据有限时的任务方法感兴趣。我们将模型性能在模拟的低资源设置中的高资源语言的德语,英语和印度尼西亚语与在真正的低资源语言Popoluca和Tepehua的新数据集上进行的实验进行了比较。我们探索了两个新的任务模型,从形态产生的紧密相关领域借用:LSTM指针生成器和一个序列到序列模型,具有通过模仿学习训练的硬单调注意力。我们发现,在低资源环境中,小说在所有语言上的表现都高出高达11.4%的精度。但是,尽管所有语言的模拟低资源场景的准确性均超过50%,但对于真正的低资源语言Popoluca和Tepehua,我们的最佳模型分别仅获得37.4%和28.4%的精度。因此,我们得出的结论是,对于低资源语言而言,规范细分仍然是一项具有挑战性的任务。
Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.