论文标题

低资源语言的文本归一化:Ligurian的情况

Text normalization for low-resource languages: the case of Ligurian

论文作者

Lusito, Stefano, Ferrante, Edoardo, Maillard, Jean

论文摘要

文本归一化是低资源语言的至关重要技术,缺乏严格的拼写惯例或已经进行了多项拼写改革。迄今为止,低资源的文本归一化依赖于手工制作的规则,而手工制作的规则比神经方法更有效。在本文中,我们研究了Ligurian(一种濒临灭绝的浪漫语言)的文本归一化情况。我们收集了4,394个Ligurian句子,并配对其标准化版本,以及Ligurian的第一个开源单语语料库。我们表明,尽管有少量可用的数据,但可以训练基于紧凑的变压器的模型,以通过使用反射和适当的令牌化来达到非常低的错误率。

Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源