论文标题

基于抽象解释的数据泄漏静态分析

Abstract Interpretation-Based Data Leakage Static Analysis

论文作者

Drobnjaković, Filip, Subotić, Pavle, Urban, Caterina

论文摘要

数据泄漏是机器学习中的一个众所周知的问题。当使用培训数据集外部的信息来创建模型时,就会发生数据泄漏。这种现象使一个模型过于乐观,甚至在现实世界中毫无用处,因为该模型倾向于在不公平获取的信息上极大地利用。迄今为止,数据泄漏的检测是使用运行时方法发生后验尸的。但是,由于数据泄漏的阴险性质,数据科学家可能首先发生了数据泄漏。因此,在开发生命周期中尽早检测数据泄漏是有利的。在本文中,我们提出了一种新颖的静态分析,以检测开发时间期间数据泄漏的几个实例。我们使用抽象解释的框架来定义分析:我们定义了一种声音和完整的具体语义,从中我们得出了声音和可计算的抽象语义。我们在开源Nblyzer静态分析框架内实施静态分析,并通过评估其在2000多个Kaggle竞争笔记本上的性能和精确度来证明其实用性。

Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源