论文标题

弱监督的POS标签者在真正的低资源语言上的表现不佳

Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

论文作者

Kann, Katharina, Lacroix, Ophélie, Søgaard, Anders

论文摘要

据报道,仅基于各种形式的弱监督,例如跨语言转移,类型级别的监督或其组合的言论(POS)标记者,这些语言仅基于各种形式的弱点语言。但是,弱监督的POS标记器通常仅对与真正低资源语言非常不同的语言进行评估,并且标签者使用的信息来源,例如高覆盖范围和几乎没有错误的词典,这些词典可能无法用于资源贫乏语言。我们培训并评估了最新的弱监督POS标记符,其中包括15种真正的低资源语言的类型多样性。在这些语言上,考虑到逼真的资源,即使我们的最佳模型也只有不到正确的单词的一半。我们的结果强调了对真正低资源语言的新方法进行POS标记的需求。

Part-of-speech (POS) taggers for low-resource languages which are exclusively based on various forms of weak supervision - e.g., cross-lingual transfer, type-level supervision, or a combination thereof - have been reported to perform almost as well as supervised ones. However, weakly supervised POS taggers are commonly only evaluated on languages that are very different from truly low-resource languages, and the taggers use sources of information, like high-coverage and almost error-free dictionaries, which are likely not available for resource-poor languages. We train and evaluate state-of-the-art weakly supervised POS taggers for a typologically diverse set of 15 truly low-resource languages. On these languages, given a realistic amount of resources, even our best model gets only less than half of the words right. Our results highlight the need for new and different approaches to POS tagging for truly low-resource languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源