论文标题
力量很少有巨大的责任
With Little Power Comes Great Responsibility
论文作者
论文摘要
尽管它对实验设计的重要性,但NLP社区在很大程度上忽略了统计能力(鉴于实际效果,实验将拒绝原假设的概率)。能力不足的实验使得统计噪声和有意义的模型改进之间的差异变得更加困难,并增加了夸张发现的机会。通过对一组现有的NLP论文和数据集进行元分析,我们表征了各种设置的典型功率,并得出结论,NLP文献中的功能不足实验是常见的。特别是,对于流行的胶水基准中的几个任务,小测试集意味着与最新模型的大多数尝试进行比较将无法充分发挥作用。同样,基于合理的假设,我们发现人类评级研究最典型的实验设计将不足以检测经常研究的小型模型差异。对于机器翻译,我们发现2000个句子的典型测试集具有大约75%的功率来检测1个BLEU点的差异。为了改善未来的情况,我们概述了NLP中功率分析的最佳实践,并发布了一系列笔记本,以协助未来的功率分析。
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.