论文标题
颠簸:不忠最小对的基准,用于对忠诚度量的元评估
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
论文作者
论文摘要
自动忠诚指标的汇总度量的扩散产生了对评估它们的基准的需求。尽管现有的基准测量了与人类对模型生成的摘要的忠诚判断的相关性,但它们不足以诊断指标是否为:1)一致,即表明较低的忠诚是将错误引入摘要,2)对人类写的文本有效,以及对不同的错误类型敏感的(AS 3)可能包含多个误差(aslors)。为了满足这些需求,我们提出了不忠最小对(bump)的基准,该数据集是889个人工写的,最小不同的摘要对的数据集,其中将单个错误引入了CNN/DailyMail数据集的摘要中,以产生一个不忠的摘要。 We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP启用测量指标在单个错误类型上的性能。
The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metrics' performance on individual error types.