论文标题
与shapley重要的数据调试相对于端到端机器学习管道
Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines
论文作者
论文摘要
开发现代机器学习(ML)应用是以数据为中心的,其中一个基本挑战是了解数据质量对ML培训的影响 - “哪些培训示例在使训练有素的ML模型预测不准确或不公平时是'有罪的?”在过去的十年中,对ML培训的数据影响引起了密集的兴趣,并且一个流行的框架是计算每个培训示例的Shapley价值相对于公用事业,例如验证的准确性和训练有素的ML模型的公平性。不幸的是,尽管最近进行了密集的兴趣和研究,但现有方法仅考虑一个单一的ML模型“隔离”,并且不考虑由数据转换,功能提取器和ML培训组成的端到端ML管道。 我们提出DataScope(siel.ml/datascope),这是第一个系统,可以有效地计算出培训示例的Shapley值,而不是端到端ML管道,并说明其在ML培训的数据调试中的应用。为此,我们首先开发了一种新颖的算法框架,该算法框架计算出对特定的ML管道家族的shapley价值,我们称之为规范管道:一个正相关的代数查询,然后是K-Nearest-neart-Neighbor(KNN)分类器。我们表明,对于典型管道的许多亚家族,计算shapley值在ptime中,与计算shapley值的指数复杂性形成了鲜明对比。然后,我们将其进行练习 - 鉴于一条Sklearn Pipeline,我们将其用作代理的规范管道将其近似。我们进行了广泛的实验,说明了不同的用例和公用事业。我们的结果表明,DataScope比最新的基于蒙特卡洛的方法快四个数量级,同时相当且通常更有效地在数据调试中有效。
Developing modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to compute the Shapley value of each training example with respect to utilities such as validation accuracy and fairness of the trained ML model. Unfortunately, despite recent intensive interest and research, existing methods only consider a single ML model "in isolation" and do not consider an end-to-end ML pipeline that consists of data transformations, feature extractors, and ML training. We present DataScope (ease.ml/datascope), the first system that efficiently computes Shapley values of training examples over an end-to-end ML pipeline, and illustrate its applications in data debugging for ML training. To this end, we first develop a novel algorithmic framework that computes Shapley value over a specific family of ML pipelines that we call canonical pipelines: a positive relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We show that, for many subfamilies of canonical pipelines, computing Shapley value is in PTIME, contrasting the exponential complexity of computing Shapley value in general. We then put this to practice -- given an sklearn pipeline, we approximate it with a canonical pipeline to use as a proxy. We conduct extensive experiments illustrating different use cases and utilities. Our results show that DataScope is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging.