论文标题
机器学习实践中的评估差距
Evaluation Gaps in Machine Learning Practice
论文作者
论文摘要
对机器学习(ML)模型对应用程序生态系统的适当性形成可靠的判断对于其负责使用至关重要,并且需要考虑各种因素,包括危害,福利和责任。但是,在实践中,对ML模型的评估通常仅集中在狭窄范围的非上下文化预测行为上。我们检查了评估问题的理想广度与观察到的实际评估狭窄焦点之间的评估差距。通过对最近在计算机视觉和自然语言处理社区中备受瞩目的会议的论文的实证研究,我们总体上展示了对少数评估方法的关注。通过考虑这些方法中使用的指标和测试数据分布,我们将注意力集中在现场中以哪些模型为中心,从而揭示了在评估过程中经常被忽略或隔离的属性。通过研究这些属性,我们证明了机器学习学科对具有规范性影响的一系列承诺的隐含假设;其中包括对后果主义的承诺,上下文的抽象性,影响的量化性,模型输入在评估中的有限作用以及不同故障模式的等效性。阐明这些假设使我们能够质疑其对ML系统环境的适当性,从而指向更加上下文化的评估方法,以鲁棒性检查ML模型的可信度
Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes. Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models