论文标题
一种国家,700多种语言:印度尼西亚代表性不足的语言和方言的NLP挑战
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
论文作者
论文摘要
缺乏资源和对代表性不足的语言和方言所面临的挑战的意识阻碍了NLP研究。专注于印度尼西亚所说的语言,这是世界上第二大多样化和世界上第四大人口的国家,我们概述了印度尼西亚700多种语言的NLP研究现状。我们重点介绍了印尼NLP的挑战,以及这些挑战如何影响当前NLP系统的性能。最后,我们提供了一般建议,不仅可以为印度尼西亚语言而且其他代表性不足的语言开发NLP技术。
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.