Scinli：科学文本的自然语言推断

论文标题

Scinli：科学文本的自然语言推断

SciNLI: A Corpus for Natural Language Inference on Scientific Text

论文作者

Sadat, Mobashir, Caragea, Cornelia

论文摘要

现有的自然语言推理（NLI）数据集在自然语言理解（NLU）研究的发展中发挥了作用，但与科学文本无关。在本文中，我们介绍了Scinli，这是一个用于NLI的大数据集，它捕获了科学文本中的形式，并包含从NLP和计算语言学的学术论文中提取的107,412个句子对。鉴于科学文献中使用的文本在词汇和句子结构方面与日常语言中使用的文本有很大不同，我们的数据集非常适合作为评估科学NLU模型的基准。我们的实验表明，与现有的NLI数据集相比，Scinli更难分类。我们使用XLNET的最佳性能模型仅达到78.18％，精度为78.23％，表明有很大的改进空间。

Existing Natural Language Inference (NLI) datasets, while being instrumental in the advancement of Natural Language Understanding (NLU) research, are not related to scientific text. In this paper, we introduce SciNLI, a large dataset for NLI that captures the formality in scientific text and contains 107,412 sentence pairs extracted from scholarly papers on NLP and computational linguistics. Given that the text used in scientific literature differs vastly from the text used in everyday language both in terms of vocabulary and sentence structure, our dataset is well suited to serve as a benchmark for the evaluation of scientific NLU models. Our experiments show that SciNLI is harder to classify than the existing NLI datasets. Our best performing model with XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23% showing that there is substantial room for improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题