论文标题
带有并行矢量量化的抽样流数据-PVQ
Sampling Streaming Data with Parallel Vector Quantization -- PVQ
论文作者
论文摘要
云中企业数据的积累已吸引了更多的企业应用程序来创建数据重力。结果,网络流量已变得更加以云为中心。以云流量为中心的云流量增加在设计学习系统以通过阶级失衡而引发了新的挑战。类的数量在从数据流构建的分类器的准确性中起着至关重要的作用。在本文中,我们提出了一种基于矢量量化的采样方法,该方法大大降低了数据流中的类不平衡。我们通过使用常用的ML模型构建方法对网络流量和异常数据集进行实验来证明其有效性;在Tensorflow后端,支持向量机,K-Nearest邻居和随机森林上的多层感知器。我们使用并行处理,批处理处理和随机选择样本构建了模型。我们表明,当使用我们的方法预处理数据流时,分类模型的准确性会提高。我们将这些分类器的超级参数和自动Sklearn的盒子进行了优化。
Accumulation of corporate data in the cloud has attracted more enterprise applications to the cloud creating data gravity. As a consequence, network traffic has become more cloud centric. This increase in cloud centric traffic poses new challenges in designing learning systems for streaming data due to class imbalance. The number of classes plays a vital role in the accuracy of the classifiers built from the data streams. In this paper, we present a vector quantization-based sampling method, which substantially reduces the class imbalance in data streams. We demonstrate its effectiveness by conducting experiments on network traffic and anomaly dataset with commonly used ML model building methods; Multilayered Perceptron on TensorFlow backend, Support Vector Machines, K-Nearest Neighbour, and Random Forests. We built models using parallel processing, batch processing, and randomly selecting samples. We show that the accuracy of classification models improves when the data streams are pre-processed with our method. We used out of the box hyper-parameters of these classifiers and auto sklearn for hyperparameter optimization.