论文标题
在移动键盘输入中处理复合
Handling Compounding in Mobile Keyboard Input
论文作者
论文摘要
本文提出了一个框架,以改善形态丰富的语言中移动用户的打字体验。智能手机键盘通常支持所有依赖语言模型的输入解码,更正和预测等功能。出于延迟原因,这些操作发生在设备上,因此模型的大小有限,无法轻易涵盖用户每日任务所需的所有单词,尤其是在形态上丰富的语言中。特别是,日耳曼语的复合性质使它们的词汇实际上是无限的。同样,与形态上简单的语言(例如英语或普通话)相比,大量的变化和凝集性语言(例如,斯拉夫语,土耳其语或芬诺语言)往往具有更大的词汇。我们建议用自动选择的子词单元对这些语言进行建模,并用我们称为绑定类型的内容来建模,从而使解码器可以知道何时将子词单元绑定到单词中。我们表明,这种方法的单词错误率降低了多种复合语言。这是我们以前通过更基本的方法获得的两倍以上,也是本文中所述的。
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages. Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models. For latency reasons, these operations happen on device, so the models are of limited size and cannot easily cover all the words needed by users for their daily tasks, especially in morphologically rich languages. In particular, the compounding nature of Germanic languages makes their vocabulary virtually infinite. Similarly, heavily inflecting and agglutinative languages (e.g. Slavic, Turkic or Finno-Ugric languages) tend to have much larger vocabularies than morphologically simpler languages, such as English or Mandarin. We propose to model such languages with automatically selected subword units annotated with what we call binding types, allowing the decoder to know when to bind subword units into words. We show that this method brings around 20% word error rate reduction in a variety of compounding languages. This is more than twice the improvement we previously obtained with a more basic approach, also described in the paper.