论文标题

探测用于遗传和类型学信号的多语言BERT

Probing Multilingual BERT for Genetic and Typological Signals

论文作者

Rama, Taraka, Beinborn, Lisa, Eger, Steffen

论文摘要

我们在多语言Bert(Mbert)中探测了跨100种语言的系统发育和地理语言信号的层,并根据Mbert表示计算语言距离。我们1)采用语言距离来推断和评估语言树,发现它们与四重奏树的距离相近,2)执行距离矩阵回归分析,发现该语言距离可以通过系统发育和最差的结构因素和最差的结构因素和3)进行最佳解释,而3)呈现出了一项新颖的量度,以衡量透明含义(基于交叉含义)的衡量标准(基于交叉统一性),这是基于跨语言的稳定性,该量依赖于跨度的代表性,该列表的范围是在上升的范围内,该措施是在范围内的,该措施是基于差异性的,该列表依赖于差异性的稳定性)方法。我们的结果有助于跨语义文本表示的类型学解释性的新生领域。

We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis, finding that the language distances can be best explained by phylogenetic and worst by structural factors and 3) present a novel measure for measuring diachronic meaning stability (based on cross-lingual representation variability) which correlates significantly with published ranked lists based on linguistic approaches. Our results contribute to the nascent field of typological interpretability of cross-lingual text representations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源