用多语言的伯特，一个小型语料库和一个小的树仓解析

论文标题

用多语言的伯特，一个小型语料库和一个小的树仓解析

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

论文作者

Chau, Ethan C., Lin, Lucy H., Smith, Noah A.

论文摘要

预处理的多语言上下文表示已取得了巨大的成功，但是由于其预处理数据的限制，其好处并不适用于所有语言品种。这给这些模型不熟悉的语言品种带来了一个挑战，这些模型的标记为\ emph {and nOmbeled}数据太限制了无法有效地训练单语模型。我们建议使用其他特定语言的预处理和词汇量扩展，以使多语言模型适应低资源设置。利用依赖性解析四种不同的低资源语言品种作为案例研究，我们表明，这些方法可显着提高基准的性能，尤其是在最低的资源案例中，并证明了此类模型的数据和目标语言品种之间关系的重要性。

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled \emph{and unlabeled} data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.

下载PDF全文

下载文献需遵守相关版权规定

论文标题