论文标题

为什么扔掉数据会改善最坏的组错误?

Why does Throwing Away Data Improve Worst-Group Error?

论文作者

Chaudhuri, Kamalika, Ahuja, Kartik, Arjovsky, Martin, Lopez-Paz, David

论文摘要

在与班级或小组不平衡的数据面对数据时,从业人员遵循一种有趣的策略来取得最佳结果。他们丢弃了示例,直到班级或小组的大小平衡,然后对减少的训练集进行经验风险最小化。这反对学习理论中的共同智慧,随着数据集的增加,预期错误应该减少。在这项工作中,我们利用极值理论来解决这一明显的矛盾。我们的结果表明,数据分布的尾巴在确定线性分类器最差的组准确性方面起着重要作用。当学习重尾数据时,扔掉数据会恢复所得分类器的几何对称性,因此改善了其最差的组概括。

When facing data with imbalanced classes or groups, practitioners follow an intriguing strategy to achieve best results. They throw away examples until the classes or groups are balanced in size, and then perform empirical risk minimization on the reduced training set. This opposes common wisdom in learning theory, where the expected error is supposed to decrease as the dataset grows in size. In this work, we leverage extreme value theory to address this apparent contradiction. Our results show that the tails of the data distribution play an important role in determining the worst-group-accuracy of linear classifiers. When learning on data with heavy tails, throwing away data restores the geometric symmetry of the resulting classifier, and therefore improves its worst-group generalization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源