论文标题
关于数据集平衡的局限性:与虚假相关性的失去战斗
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
论文作者
论文摘要
最近的工作表明,NLP中的深度学习模型对简单特征和特定输出标签之间的低级相关性高度敏感,从而导致过度拟合和缺乏概括。为了减轻这个问题,一种常见的做法是通过添加新实例或过滤“简单”实例来平衡数据集(Sakaguchi等,2020),最终在最近的一项提案中取消了完全消除单词相关性的建议(Gardner等,2021)。在本意见论文中,我们确定了尽管做出了这些努力,但越来越多的模型不断利用不断增加的伪造相关性,因此,即使平衡所有单词功能,也不足以减轻所有这些相关性。同时,真正平衡的数据集可能会“用沐浴水扔掉婴儿”,而错过了编码常识和世界知识的重要信号。我们重点介绍了数据集平衡的几种替代方案,专注于以更丰富的上下文增强数据集,使模型可以弃用和与用户互动,并从大规模的微调转变为零或几次播放设置。
Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.