论文标题
非正式波斯普遍依赖树库
Informal Persian Universal Dependency Treebank
论文作者
论文摘要
本文介绍了形式和非正式波斯语之间的语音,形态和句法区别,表明这两个变体具有根本的差异,而这些差异不能完全归因于发音差异。鉴于非正式的波斯人表现出特殊的特征,因此在波斯正式培训的任何计算模型都不太可能转移到非正式波斯语上,因此需要为此品种创建专门的树库。因此,我们详细介绍了开源的非正式波斯普遍依赖树库的开发,这是通用依赖方案中注释的新树库。然后,我们通过培训两个依赖性解析器对现有的正式树库来研究非正式波斯人的解析,并将它们评估为室外数据,即我们的非正式树库的开发集。我们的结果表明,当我们跨两个领域移动时,解析器的性能下降,因为它们面对更未知的令牌和结构,并且无法很好地概括。此外,性能最大的依赖关系代表了非正式变体的独特属性。这项研究的最终目标表明,更广泛的影响是提供垫脚石,以揭示非正式语言变体的重要性,而语言的意义已被跨语言的自然语言处理工具广泛忽视。
This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.