论文标题

大型数据集中半参数贝叶斯新颖性检测的变异推断

Variational Inference for Semiparametric Bayesian Novelty Detection in Large Datasets

论文作者

Benedetti, Luca, Boniardi, Eric, Chiani, Leonardo, Ghirri, Jacopo, Mastropietro, Marta, Cappozzo, Andrea, Denti, Francesco

论文摘要

在接受完整标记的训练集(将观测值分为一定数量的已知类别中)进行了训练之后,新颖的检测方法旨在对未标记的测试集进行分类,同时允许存在以前看不见的类别。这些模型在许多领域都很有价值,从社交网络和食品掺假分析到可能存在不断发展的人群的生物学。在本文中,我们专注于最近在文献中引入的两阶段贝叶斯半参数探测器(也称为Brand)。为了利用基于模型的混合物表示,品牌允许将测试观测值聚类到已知的训练术语或单个新颖性项中。此外,新颖性术语是使用Dirichlet工艺混合模型建模的,以灵活地捕获与已知模式的任何偏离。品牌最初是使用MCMC方案估算的,当将其应用于高维数据时,品牌的成本高昂。为了扩大对大型数据集的品牌适用性,我们建议求助于各种贝叶斯方法,从而为后近似提供有效的算法。通过彻底的模拟研究,我们证明了效率和出色分类性能的显着提高。最后,为了展示其适用性,我们使用开放式的Statlog数据集(大量卫星成像光谱)进行了新颖的检测分析,以搜索新型的土壤类型。

After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源