论文标题
D.MCA:具有显式微簇分配的离群值检测
D.MCA: Outlier Detection with Explicit Micro-Cluster Assignments
论文作者
论文摘要
我们如何在不知道存在多少微群体的情况下,如何检测分散和聚类的离群值,并明确分配给相应的微群体?我们如何在内部执行这两个任务,即没有任何事后处理,以便检测和分配可以同时彼此受益?在单独的微群体中呈现异常值对许多现实世界应用中的分析师提供了信息。但是,基于任何现有方法检测到的离群值的事后聚类的幼稚解决方案遭受了两个主要缺点:(a)合适的超参数值通常是未知的,对于聚类而言,大多数算法都与各种形状和密度不同的算法困难; (b)检测和分配不能彼此受益。在本文中,我们将d.mca提出至$ \ useverline {d} $ etect Utliers,带有显式$ \ usewises {m} $ icro-icro-$ \ usepline {c} $ luster $ \ lustline $ \ usepline {a} $ ssignment。我们的方法通过使用一种新型策略来迭代和内部进行检测和分配,该策略将整个微群体从训练集中降低以提高检测的性能。它还受益于一种新型策略,该策略避免了聚类的异常值互相掩盖,这在文献中是一个众所周知的问题。同样,D.MCA通过使用超元件“热身”阶段而设计为对关键的超级参数具有鲁棒性。在16个现实世界和合成数据集上执行的实验表明,D.MCA的表现优于8个最先进的竞争对手,尤其是在显式离群的微型群集分配任务上。
How can we detect outliers, both scattered and clustered, and also explicitly assign them to respective micro-clusters, without knowing apriori how many micro-clusters exist? How can we perform both tasks in-house, i.e., without any post-hoc processing, so that both detection and assignment can benefit simultaneously from each other? Presenting outliers in separate micro-clusters is informative to analysts in many real-world applications. However, a naïve solution based on post-hoc clustering of the outliers detected by any existing method suffers from two main drawbacks: (a) appropriate hyperparameter values are commonly unknown for clustering, and most algorithms struggle with clusters of varying shapes and densities; (b) detection and assignment cannot benefit from one another. In this paper, we propose D.MCA to $\underline{D}$etect outliers with explicit $\underline{M}$icro-$\underline{C}$luster $\underline{A}$ssignment. Our method performs both detection and assignment iteratively, and in-house, by using a novel strategy that prunes entire micro-clusters out of the training set to improve the performance of the detection. It also benefits from a novel strategy that avoids clustered outliers to mask each other, which is a well-known problem in the literature. Also, D.MCA is designed to be robust to a critical hyperparameter by employing a hyperensemble "warm up" phase. Experiments performed on 16 real-world and synthetic datasets demonstrate that D.MCA outperforms 8 state-of-the-art competitors, especially on the explicit outlier micro-cluster assignment task.