专利短语到语义匹配数据集短语

论文标题

专利短语到语义匹配数据集短语

Patents Phrase to Phrase Semantic Matching Dataset

论文作者

Aslanyan, Grigor, Wetherbee, Ian

论文摘要

有许多通用基准数据集用于语义文本相似性，但没有一个专利和科学出版物中发现的技术概念。这项工作旨在通过提出一个新的人类评级的上下文短语来填补这一空白，以匹配数据集。整个数据集包含近50,000美元的额定短语对，每个级别的短语对，每个CPC（合作专利分类）类都是上下文。本文描述了数据集和一些基线模型。

There are many general purpose benchmark datasets for Semantic Textual Similarity but none of them are focused on technical concepts found in patents and scientific publications. This work aims to fill this gap by presenting a new human rated contextual phrase to phrase matching dataset. The entire dataset contains close to $50,000$ rated phrase pairs, each with a CPC (Cooperative Patent Classification) class as a context. This paper describes the dataset and some baseline models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题