论文标题
Informask:语言模型预审计的无监督信息掩盖
InforMask: Unsupervised Informative Masking for Language Model Pretraining
论文作者
论文摘要
蒙版语言建模广泛用于预处理自然语言理解(NLU)的大型语言模型。但是,随机掩蔽是次优的,为所有令牌分配了相等的掩蔽率。在本文中,我们提出了Informask,这是一种新的无监督掩护策略,用于培训掩盖语言模型。 Informask利用点式共同信息(PMI)选择掩盖最有用的令牌。我们进一步提出了两个优化,以提高其效率。通过一次性的预处理步骤,Informask在事实召回基准喇嘛和回答基准小队V1和V2的问题上都超过了随机掩蔽和先前提出的掩盖策略。
Masked language modeling is widely used for pretraining large language models for natural language understanding (NLU). However, random masking is suboptimal, allocating an equal masking rate for all tokens. In this paper, we propose InforMask, a new unsupervised masking strategy for training masked language models. InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask. We further propose two optimizations for InforMask to improve its efficiency. With a one-off preprocessing step, InforMask outperforms random masking and previously proposed masking strategies on the factual recall benchmark LAMA and the question answering benchmark SQuAD v1 and v2.