论文标题
马克伯特:标记单词边界改善了中文伯特
MarkBERT: Marking Word Boundaries Improves Chinese BERT
论文作者
论文摘要
我们提出了一个称为Markbert的中国Bert模型,该模型在这项工作中使用单词信息。但是,现有的基于单词的BERT模型将单词视为基本单元,但是,由于BERT的词汇限制,它们仅涵盖高频单词,并且在遇到不播放(OOV)单词的情况下回到角色级别。与现有作品不同,Markbert将词汇保留为汉字,并在连续单词之间插入边界标记。这种设计使模型可以以相同的方式处理任何单词,无论它们是OOV是否单词。此外,我们的模型还有两个额外的好处:首先,添加与标记的单词级学习目标很方便,这是与传统性格和句子级预审前任务互补的;其次,它可以通过用特定于POS标记的标记替换通用标记来轻松地结合更丰富的语义,例如单词的POS标签。通过简单的标记插入,Markbert可以改善各种下游任务的性能,包括语言理解和序列标签。 \ footNote {所有代码和模型将在\ url {https://github.com/daiyongya/markbert}}公开上市。
We present a Chinese BERT model dubbed MarkBERT that uses word information in this work. Existing word-based BERT models regard words as basic units, however, due to the vocabulary limit of BERT, they only cover high-frequency words and fall back to character level when encountering out-of-vocabulary (OOV) words. Different from existing works, MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Such design enables the model to handle any words in the same way, no matter they are OOV words or not. Besides, our model has two additional benefits: first, it is convenient to add word-level learning objectives over markers, which is complementary to traditional character and sentence-level pretraining tasks; second, it can easily incorporate richer semantics such as POS tags of words by replacing generic markers with POS tag-specific markers. With the simple markers insertion, MarkBERT can improve the performances of various downstream tasks including language understanding and sequence labeling. \footnote{All the codes and models will be made publicly available at \url{https://github.com/daiyongya/markbert}}