Histbert：一种预先训练的历时词汇语义分析的语言模型

论文标题

Histbert：一种预先训练的历时词汇语义分析的语言模型

HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic Analysis

论文作者

Qiu, Wenjun, Xu, Yang

论文摘要

上下文化的单词嵌入在各种自然语言处理任务中表现出最新的性能，包括涉及历史语义变化的任务。但是，诸如BERT之类的语言模型主要接受了当代语料库数据的培训。为了调查历史语料库数据的培训是否改善了历时语义分析，我们提出了一种基于BERT的预先培训的语言模型Histbert，该模型对美国历史英语的平衡语料库进行了培训。我们通过比较原始BERT和Histbert的性能来检验方法的有效性，并报告了单词相似性和语义转移分析的有希望的结果。我们的工作表明，上下文嵌入在历时语义分析中的有效性取决于输入文本的时间概况，并且在应用这种方法来研究历史语义变化时应采取谨慎。

Contextualized word embeddings have demonstrated state-of-the-art performance in various natural language processing tasks including those that concern historical semantic change. However, language models such as BERT was trained primarily on contemporary corpus data. To investigate whether training on historical corpus data improves diachronic semantic analysis, we present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English. We examine the effectiveness of our approach by comparing the performance of the original BERT and that of HistBERT, and we report promising results in word similarity and semantic shift analysis. Our work suggests that the effectiveness of contextual embeddings in diachronic semantic analysis is dependent on the temporal profile of the input text and care should be taken in applying this methodology to study historical semantic change.

下载PDF全文

下载文献需遵守相关版权规定

论文标题