论文标题
Nusacrowd:印尼NLP资源的开源计划
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
论文作者
论文摘要
我们提出了Nusacrowd,这是一项合作计划,旨在收集和统一印尼语言的现有资源,包括打开对以前非公开资源的访问。通过此计划,我们将137个数据集和118个标准化数据加载程序汇总在一起。数据集的质量已手动和自动评估,并通过多个实验证明其价值。 Nusacrowd的数据收集可以创建第一个零镜基准,以在印度尼西亚和印度尼西亚的当地语言中为自然语言理解和产生。此外,Nusacrowd带来了印度尼西亚和印度尼西亚当地语言的第一个多语言自动语音识别基准的创建。我们的工作努力推进自然语言处理(NLP)研究,尽管说话广泛,但这种语言的说法不足。
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.