论文标题

模型选择影响属性单词关联:静态单词嵌入的半监督分析

Model Choices Influence Attributive Word Associations: A Semi-supervised Analysis of Static Word Embeddings

论文作者

Bihani, Geetanjali, Rayz, Julia Taylor

论文摘要

静态单词嵌入式编码单词关联,在下游NLP任务中广泛使用。尽管先前的研究已经从偏差和词汇规律上讨论了这种单词关联的性质,但基于嵌入训练程序的单词关联的变化仍然是默默无闻的。这项工作旨在通过评估五个不同静态嵌入体系结构的属性单词关联来解决这一差距,从而分析了选择模型体系结构,上下文学习风味和培训语料库的影响。我们的方法利用一种半监督的聚类方法来基于其单词嵌入特征的群集注释的适当名词和形容词,从而揭示了嵌入空间中形成的基本属性词,而无需引入任何确认偏见。我们的结果表明,在嵌入培训期间的上下文学习口味(CBOW vs skip-gram)的选择会影响协会一词的区分性和嵌入词对培训语料库中偏差的敏感性。此外,从经验上表明,即使经过同一语料库进行训练,在不同单词嵌入模型的编码单词关联中也存在重大模型间差异和模型内模型相似性,以为每个嵌入体系结构创建嵌入空间的方式描绘特定模式。

Static word embeddings encode word associations, extensively utilized in downstream NLP tasks. Although prior studies have discussed the nature of such word associations in terms of biases and lexical regularities captured, the variation in word associations based on the embedding training procedure remains in obscurity. This work aims to address this gap by assessing attributive word associations across five different static word embedding architectures, analyzing the impact of the choice of the model architecture, context learning flavor and training corpora. Our approach utilizes a semi-supervised clustering method to cluster annotated proper nouns and adjectives, based on their word embedding features, revealing underlying attributive word associations formed in the embedding space, without introducing any confirmation bias. Our results reveal that the choice of the context learning flavor during embedding training (CBOW vs skip-gram) impacts the word association distinguishability and word embeddings' sensitivity to deviations in the training corpora. Moreover, it is empirically shown that even when trained over the same corpora, there is significant inter-model disparity and intra-model similarity in the encoded word associations across different word embedding models, portraying specific patterns in the way the embedding space is created for each embedding architecture.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源