论文标题

基准测试长尾概括,可能分裂

Benchmarking Long-tail Generalization with Likelihood Splits

论文作者

Godbole, Ameya, Jia, Robin

论文摘要

为了可靠地处理自然语言,NLP系统必须推广到罕见话语的长尾。我们提出了一种创建具有挑战性的基准的方法,该基准需要通过重新分解现有数据集来推广到分布的尾巴。我们创建“似然分裂”,其中将预先训练的语言模型(LM)分配的示例放置在测试集中,并且更有可能的示例是在培训集中。可以自定义这种简单的方法来构建有意义的火车测试拆分,以完成各种任务。与随机分裂相比,似然表面的挑战更多:针对蜘蛛的语义解析,最新模型的相对错误率增加了59%,SNLI的自然语言推断为93%,而BOOLQ上的YES/否提出了33%,与相应的随机分裂相比,BOOLQ上的问题是/否。此外,可能性拆分创造的基准比对抗性过滤更明确。当使用用于创建拆分的LM也被用作任务模型时,我们的拆分不会不公平地惩罚LM。

In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood Splits surface more challenges than random splits: relative error rates of state-of-the-art models increase by 59% for semantic parsing on Spider, 93% for natural language inference on SNLI, and 33% for yes/no question answering on BoolQ, on our splits compared with the corresponding random splits. Moreover, Likelihood Splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is also employed as the task model, our splits do not unfairly penalize the LM.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源