刺：专利分类数据集

论文标题

刺：专利分类数据集

CinPatent: Datasets for Patent Classification

论文作者

Nguyen, Minh-Tien, Bui, Nhung, Tran-Tien, Manh, Le, Linh, Vu, Huy-The

论文摘要

专利分类是将每个输入专利分配到几个代码（类）的任务。由于其需求很高，因此引入了几种数据集和方法。但是，缺乏对基准的系统性能比较和对某些数据集的访问的访问，这为任务造成了差距。为了填补空白，我们通过使用CPC代码介绍了两个新的英语和日语数据集。英文数据集包括45,131个具有425个标签的专利文档，日本数据集包含54,657个文档，带有523个标签。为了促进下一项研究，我们比较了两个数据集上强的多标签文本分类方法的性能。实验结果表明，注意力XML始终比其他强大的基准更好。消融研究还在两个方面进行：专利的不同部分（标题，摘要，描述和主张）的贡献以及基本线在绩效方面的行为以及不同的培训数据细分。我们使用基线的代码发布了两个新数据集。

Patent classification is the task that assigns each input patent into several codes (classes). Due to its high demand, several datasets and methods have been introduced. However, the lack of both systematic performance comparison of baselines and access to some datasets creates a gap for the task. To fill the gap, we introduce two new datasets in English and Japanese collected by using CPC codes. The English dataset includes 45,131 patent documents with 425 labels and the Japanese dataset contains 54,657 documents with 523 labels. To facilitate the next studies, we compare the performance of strong multi-label text classification methods on the two datasets. Experimental results show that AttentionXML is consistently better than other strong baselines. The ablation study is also conducted in two aspects: the contribution of different parts (title, abstract, description, and claims) of a patent and the behavior of baselines in terms of performance with different training data segmentation. We release the two new datasets with the code of the baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题