荒野：野外分配转移的基准

论文标题

荒野：野外分配转移的基准

WILDS: A Benchmark of in-the-Wild Distribution Shifts

论文作者

Koh, Pang Wei, Sagawa, Shiori, Marklund, Henrik, Xie, Sang Michael, Zhang, Marvin, Balsubramani, Akshay, Hu, Weihua, Yasunaga, Michihiro, Phillips, Richard Lanas, Gao, Irena, Lee, Tony, David, Etienne, Stavness, Ian, Guo, Wei, Earnshaw, Berton A., Haque, Imran S., Beery, Sara, Leskovec, Jure, Kundaje, Anshul, Pierson, Emma, Levine, Sergey, Finn, Chelsea, Liang, Percy

论文摘要

分配变化 - 训练分布与测试分布不同的地方可能会大大降低机器学习（ML）系统部署在野外的准确性。尽管它们在现实世界的部署中无处不在，但这些分配变化在当今的ML社区中广泛使用的数据集中的代表性不足。为了解决这一差距，我们提出了荒野，这是10个数据集的策划基准，反映了在现实世界中自然出现的各种分配变化，例如跨医院的转变以供肿瘤识别；跨相机陷阱进行野生动植物监测；以及卫星成像和贫困映射中的时间和位置。在每个数据集上，我们表明标准培训的产量要比分布性能低得多。即使是通过现有方法来应对分配变化的现有方法训练的模型，该差距仍然存在，强调了对培训模型的新方法的需求，这些方法对实践中出现的分配变化类型更为强大。为了促进方法开发，我们提供了一个开源软件包，该软件包可自动化数据集加载，包含默认模型架构和超参数，并标准化评估。代码和排行榜可在https://wilds.stanford.edu上找到。

Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题