论文标题
开放韩语语料库:实际报告
Open Korean Corpora: A Practical Report
论文作者
论文摘要
在研究界,韩语通常被称为一种低资源语言。尽管此索赔在某种程度上是正确的,但这也是因为资源的可用性是不充分的宣传和策划的。这项工作策划并审查了韩国语料库的列表,该清单首先描述了机构级的资源开发,然后通过当前开放数据集的列表来进一步迭代不同类型的任务。然后,我们提出了一个方向,介绍了如何对资源较低的语言进行开源数据集构建和发行以促进研究。
Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.