论文标题

E-NER-一个带注释的命名实体识别法律文本语料库

E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

论文作者

Au, Ting Wai Terence, Cox, Ingemar J., Lampos, Vasileios

论文摘要

在文档中识别指定的实体,例如人,位置或组织,可以向读者突出关键信息。名为实体识别(NER)模型的培训需要一个带注释的数据集,这可能是一项耗时的劳动力密集型任务。然而,对于一般英语,仍有公开可用的数据集。最近,人们有兴趣为法律文本开发NER。但是,此处报告的先前工作和实验结果表明,当对一般英语数据集培训的NER方法将其应用于法律文本时,性能会产生重大降级。我们根据美国证券交易委员会的Edgar数据集可用的法律公司文件来描述一个名为E-NER的公开合法数据集。在一般英语Conll-2003语料库上培训许多不同的NER算法,但对我们的测试收集进行测试证实,通过F1得分测量的准确性显着降解,与E-NER收集的培训和测试相比,F1得分在29.4 \%和60.4 \%之间衡量。

Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源