论文标题
通过在线积极学习和暹罗神经网络的非组织数据流分类
Nonstationary data stream classification with online active learning and siamese neural networks
论文作者
论文摘要
近年来,我们目睹了不断增长的信息在各个应用领域以流媒体方式获得。结果,有必要对在线学习方法进行训练的预测模型。但是,一系列的公开挑战阻碍了他们在实践中的部署。这些都是,随着数据的实时到达,学习,从数据中学习有限的地面真相信息,从非组织数据中学习,并从严重失衡的数据中学习,同时占据了有限的记忆以进行数据存储。我们提出了Actisiamese算法,该算法通过结合在线积极学习,暹罗网络和多标题记忆来解决这些挑战。它开发了一种新的基于密度的主动学习策略,该策略考虑了潜在(而不是输入)空间中的相似性。我们进行了一项广泛的研究,比较了不同的活跃学习预算和策略的作用,具有/不记忆的表现,在不同数据非机构性特征和阶级不平衡水平下,在合成和现实世界中,具有/不结合的性能,在合成和现实世界中的作用。 Actisiamese的表现优于基线和最先进的算法,并且在严重的失衡下有效,甚至只有在到达实例标签的一小部分时才有效。我们将代码公开发布给社区。
We have witnessed in recent years an ever-growing volume of information becoming available in a streaming manner in various application areas. As a result, there is an emerging need for online learning methods that train predictive models on-the-fly. A series of open challenges, however, hinder their deployment in practice. These are, learning as data arrive in real-time one-by-one, learning from data with limited ground truth information, learning from nonstationary data, and learning from severely imbalanced data, while occupying a limited amount of memory for data storage. We propose the ActiSiamese algorithm, which addresses these challenges by combining online active learning, siamese networks, and a multi-queue memory. It develops a new density-based active learning strategy which considers similarity in the latent (rather than the input) space. We conduct an extensive study that compares the role of different active learning budgets and strategies, the performance with/without memory, the performance with/without ensembling, in both synthetic and real-world datasets, under different data nonstationarity characteristics and class imbalance levels. ActiSiamese outperforms baseline and state-of-the-art algorithms, and is effective under severe imbalance, even only when a fraction of the arriving instances' labels is available. We publicly release our code to the community.