论文标题
使用逼真的合成数据基准测试无监督的离群值检测
Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data
论文作者
论文摘要
难以监督的离群值检测很困难。异常值很少见,现有的基准数据包含具有各种特征和未知特征的异常值。完全合成的数据通常由异常值和具有明确特征的定期实例组成,因此可以原则上对检测方法进行更有意义的评估。尽管如此,只有很少的尝试将合成数据包括在基准中以进行异常检测。这可能是由于异常值的不精确概念,或者是由于与合成数据良好覆盖范围的难度。在这项工作中,我们为生成这种基准测试的数据集提出了一个通用过程。核心思想是在产生异常值的同时,从现有现实世界的基准数据中重建常规实例,以表现出洞察力的特征。这既可以涵盖域,又可以对结果进行有益的解释。我们还描述了通用过程的三个实例化,该过程产生具有特定特征的异常值,例如本地异常值。具有最先进检测方法的基准确认我们的通用过程确实是实用的。
Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instance with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work we propose a generic process for the generation of data sets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. This allows both for a good coverage of domains and for helpful interpretations of results. We also describe three instantiations of the generic process that generate outliers with specific characteristics, like local outliers. A benchmark with state-of-the-art detection methods confirms that our generic process is indeed practical.