在跨域NER中搜索最佳子字

论文标题

在跨域NER中搜索最佳子字

Searching for Optimal Subword Tokenization in Cross-domain NER

论文作者

Ma, Ruotian, Tan, Yiding, Zhou, Xin, Chen, Xuanting, Liang, Di, Wang, Sirui, Wu, Wei, Gui, Tao, Zhang, Qi

论文摘要

输入分布移位是无监督域适应（UDA）中的重要问题之一。最受欢迎的UDA方法集中在域不变表示学习上，试图将不同域中的功能调整为相似的特征分布。但是，这些方法忽略了域之间的输入单词分布的直接对齐，这是单词级分类任务（例如跨域NER）的重要因素。在这项工作中，我们通过引入一个子词级解决方案X-Pience来为输入单词级分布位移，从而在跨域NER上开发了新的灯光。具体而言，我们重新将源域的输入单词重新介绍以接近目标子词分布，该分布是作为最佳运输问题制定和解决的。由于这种方法着重于输入级别，因此它也可以与先前的DIRL方法相结合，以进一步改进。实验结果表明，基于Bert-Tagger在四个基准NER数据集上提出的方法的有效性。同样，事实证明，所提出的方法使诸如Dann之类的DIRL方法受益。

Input distribution shift is one of the vital problems in unsupervised domain adaptation (UDA). The most popular UDA approaches focus on domain-invariant representation learning, trying to align the features from different domains into similar feature distributions. However, these approaches ignore the direct alignment of input word distributions between domains, which is a vital factor in word-level classification tasks such as cross-domain NER. In this work, we shed new light on cross-domain NER by introducing a subword-level solution, X-Piece, for input word-level distribution shift in NER. Specifically, we re-tokenize the input words of the source domain to approach the target subword distribution, which is formulated and solved as an optimal transport problem. As this approach focuses on the input level, it can also be combined with previous DIRL methods for further improvement. Experimental results show the effectiveness of the proposed method based on BERT-tagger on four benchmark NER datasets. Also, the proposed method is proved to benefit DIRL methods such as DANN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题