论文标题
TVOR:发现直方图之间的离散总变化异常值
TVOR: Finding Discrete Total Variation Outliers among Histograms
论文作者
论文摘要
皮尔逊的卡方检验可以检测一组直方图的数据分布中的异常值。但是,在人口统计(例如出生年)等领域中,在直方图平滑度方面可以更容易地找到异常值,在这种直方图平滑度中,诸如Whipple或Myers指数等技术仅处理特定异常情况。本文提出了直方图之间的平滑性离群值通过使用离散的总变化(DTV)及其各自的样本大小之间的关系。该关系在数学上得出在所有情况下都适用,并通过准确的线性模型进行了简化。直方图DTV与模型预测的值的偏差用作离群得分,并且所提出的方法命名为“总变异异常值识别器”(TVOR)。 TVOR不需要关于直方图“样品”分布的先前假设,它没有需要调整的超参数,它不仅限于特定模式,而且适用于具有相同垃圾箱的直方图。每个垃圾箱都可以具有一个任意间隔,也可以无限。 TVOR比Pearson的卡方测试更容易发现DTV异常值。如果出现分配异常值,则相反。 TVOR在实际普查数据上进行了测试,并成功地发现了可疑直方图。源代码在https://github.com/discretetotalvariation/tvor上给出。
Pearson's chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple's or Myers' indices handle successfully only specific anomalies. This paper proposes smoothness outliers detection among histograms by using the relation between their discrete total variations (DTV) and their respective sample sizes. This relation is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The deviation of the histogram's DTV from the value predicted by the model is used as the outlier score and the proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions about the histograms' samples' distribution, it has no hyperparameters that require tuning, it is not limited to only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson's chi-squared test. In case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds suspicious histograms. The source code is given at https://github.com/DiscreteTotalVariation/TVOR.