预知：用于命题级细分和构成识别的大规模语料库

论文标题

预知：用于命题级细分和构成识别的大规模语料库

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

论文作者

Chen, Sihao, Buthpitiya, Senaka, Fabrikant, Alex, Roth, Dan, Schuster, Tal

论文摘要

经过广泛研究的自然推理任务（NLI）需要一个系统来识别另一个文本是否由另一种文本涉及文本，即是否可以从另一个文本中推断出其全部含义。在当前的NLI数据集和模型中，通常在句子或段落级别上定义文本构成关系。但是，即使是简单的句子也经常包含多个命题，即句子传达的含义的不同单位。由于这些命题可以在给定前提的背景下具有不同的真实价值观，因此我们主张需要单独地识别每个命题的文本需要关系。我们提出了预期群体，这是由专家人类评估者注释的45K命题的语料库。我们的数据集结构类似于（1）在文档中分割句子的任务，以及（2）将每个命题与不同但局部分配的文档（即描述相同事件或实体）的文档进行分类。我们为细分和累积任务建立了强大的基准。通过有关摘要幻觉检测和文档级NLI的案例研究，我们证明了我们的概念框架对于理解和解释NLI标签的组成性可能有用。

The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI datasets and models, textual entailment relations are typically defined on the sentence- or paragraph-level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. As these propositions can carry different truth values in the context of a given premise, we argue for the need to recognize the textual entailment relation of each proposition in a sentence individually. We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity. We establish strong baselines for the segmentation and entailment tasks. Through case studies on summary hallucination detection and document-level NLI, we demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题