探测词汇语义的审慎语言模型

论文标题

探测词汇语义的审慎语言模型

Probing Pretrained Language Models for Lexical Semantics

论文作者

Vulić, Ivan, Ponti, Edoardo Maria, Litschko, Robert, Glavaš, Goran, Korhonen, Anna

论文摘要

伯特（Bert）和罗伯塔（Roberta）等大型语言模型（LMS）的成功引发了人们对探测其表示形式的兴趣，以便公开他们隐式捕获的知识类型。虽然先前的研究集中在形态句法，语义和世界知识上，但尚不清楚LMS在哪种程度上也从上下文中的单词中得出了词汇类型级知识。在这项工作中，我们对六种类型上多样的语言和五种不同的词汇任务进行了系统的经验分析，解决了以下问题：1）不同的词汇知识提取策略如何（单语言与多语言源LM，非语言源，外观上的副本与内部文化中的多语言源，编码，编码，包含特殊代码和层的特殊代表，以及层面特殊的avise avise avise everagrication tak a vermation tak everallive）？在任务和语言之间观察到的效果的一致性如何？ 2）词汇知识是否存储在几个参数中，还是散布在整个网络中？ 3）这些表示在词汇任务中如何反对传统的静态词向量？ 4）从独立训练的单语LM中出现的词汇信息是否显示出潜在的相似性？我们的主要结果表明了普遍存在的模式和最佳实践，但也表明了语言和任务之间的显着变化。此外，我们验证了较低变压器层具有更多类型级词汇知识的说法，但也表明该知识分布在多个层上。

The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, in order to unveil what types of knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, and world knowledge, it remains unclear to which extent LMs also derive lexical type-level knowledge from words in context. In this work, we present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks, addressing the following questions: 1) How do different lexical knowledge extraction strategies (monolingual versus multilingual source LM, out-of-context versus in-context encoding, inclusion of special tokens, and layer-wise averaging) impact performance? How consistent are the observed effects across tasks and languages? 2) Is lexical knowledge stored in few parameters, or is it scattered throughout the network? 3) How do these representations fare against traditional static word vectors in lexical tasks? 4) Does the lexical information emerging from independently trained monolingual LMs display latent similarities? Our main results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks. Moreover, we validate the claim that lower Transformer layers carry more type-level lexical knowledge, but also show that this knowledge is distributed across multiple layers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题