论文标题
用多语言的伯特,一个小型语料库和一个小的树仓解析
Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank
论文作者
论文摘要
预处理的多语言上下文表示已取得了巨大的成功,但是由于其预处理数据的限制,其好处并不适用于所有语言品种。这给这些模型不熟悉的语言品种带来了一个挑战,这些模型的标记为\ emph {and nOmbeled}数据太限制了无法有效地训练单语模型。我们建议使用其他特定语言的预处理和词汇量扩展,以使多语言模型适应低资源设置。利用依赖性解析四种不同的低资源语言品种作为案例研究,我们表明,这些方法可显着提高基准的性能,尤其是在最低的资源案例中,并证明了此类模型的数据和目标语言品种之间关系的重要性。
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled \emph{and unlabeled} data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.