论文标题

查找Bert中的逆文档频率信息

Finding Inverse Document Frequency Information in BERT

论文作者

Choi, Jaekeol, Jung, Euna, Lim, Sungjun, Rhee, Wonjong

论文摘要

数十年来,BM25及其变体一直是主要文档检索方法,其中其两个基本特征是术语频率(TF)和倒数文档频率(IDF)。然而,传统的方法正在迅速被可以利用语义特征的神经排名模型(NRMS)所取代。在这项工作中,我们考虑基于BERT的NRMS并研究IDF信息是否存在于NRMS中。这个简单的问题很有趣,因为IDF对于传统的词汇匹配是必不可少的,但是像伯特(Bert)在内的神经语言模型没有明确学习这样的全球功能。我们采用线性探测作为主要分析工具,因为典型的基于BERT的NRMS利用基于线性或基于内部产品的得分聚合器。我们分析输入嵌入,所有BERT层的表示以及CLS的自我发项权重。通过使用三个基于BERT的模型研究MS-Marco数据集,我们表明它们都包含非常依赖IDF的信息。

For many decades, BM25 and its variants have been the dominant document retrieval approach, where their two underlying features are Term Frequency (TF) and Inverse Document Frequency (IDF). The traditional approach, however, is being rapidly replaced by Neural Ranking Models (NRMs) that can exploit semantic features. In this work, we consider BERT-based NRMs and study if IDF information is present in the NRMs. This simple question is interesting because IDF has been indispensable for the traditional lexical matching, but global features like IDF are not explicitly learned by neural language models including BERT. We adopt linear probing as the main analysis tool because typical BERT based NRMs utilize linear or inner-product based score aggregators. We analyze input embeddings, representations of all BERT layers, and the self-attention weights of CLS. By studying MS-MARCO dataset with three BERT-based models, we show that all of them contain information that is strongly dependent on IDF.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源