论文标题
LTP:基于CRF的命名实体识别的一种新的主动学习策略
LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition
论文作者
论文摘要
近年来,深度学习在许多自然语言处理任务中取得了巨大的成功,包括命名实体识别。缺点是通常需要大量手动注销数据。先前的研究表明,积极学习可以精心降低数据注释的成本,但仍然有足够的改进空间。在实际应用中,我们发现现有的基于不确定性的主动学习策略有两个缺点。首先,这些策略宁愿明确或隐式选择长序列,从而增加注释者的注释负担。其次,一些策略需要入侵模型并修改以生成一些其他信息以进行样本选择,这将增加开发人员的工作量并增加模型的培训/预测时间。在本文中,我们首先在Bilstm-CRF的特定情况下检查了传统的主动学习策略,该策略已广泛用于几个典型数据集中的命名实体识别。然后,我们提出了一种基于不确定性的主动学习策略,称为最低令牌概率(LTP),该策略结合了CRF的输入和输出以选择信息的实例。 LTP是简单而有力的策略,不喜欢长时间的序列,并且不需要入侵模型。我们在多个数据集上测试了LTP,实验表明,LTP的性能要比传统策略略好,而在句子级别的准确性和实体级别的F1得分上显然具有较少的注释令牌。相关代码已在https://github.com/hit-ices/al-ner上发布
In recent years, deep learning has achieved great success in many natural language processing tasks including named entity recognition. The shortcoming is that a large amount of manually-annotated data is usually required. Previous studies have demonstrated that active learning could elaborately reduce the cost of data annotation, but there is still plenty of room for improvement. In real applications we found existing uncertainty-based active learning strategies have two shortcomings. Firstly, these strategies prefer to choose long sequence explicitly or implicitly, which increase the annotation burden of annotators. Secondly, some strategies need to invade the model and modify to generate some additional information for sample selection, which will increase the workload of the developer and increase the training/prediction time of the model. In this paper, we first examine traditional active learning strategies in a specific case of BiLstm-CRF that has widely used in named entity recognition on several typical datasets. Then we propose an uncertainty-based active learning strategy called Lowest Token Probability (LTP) which combines the input and output of CRF to select informative instance. LTP is simple and powerful strategy that does not favor long sequences and does not need to invade the model. We test LTP on multiple datasets, and the experiments show that LTP performs slightly better than traditional strategies with obviously less annotation tokens on both sentence-level accuracy and entity-level F1-score. Related code have been release on https://github.com/HIT-ICES/AL-NER