论文标题

在数据流的分类中评估K-NN与概念漂移

Evaluating k-NN in the Classification of Data Streams with Concept Drift

论文作者

de Barros, Roberto Souto Maior, Santos, Silas Garrido Teixeira de Carvalho, Barddal, Jean Paul

论文摘要

数据流通常被定义为大量数据流以高速连续流动。此外,这些数据可能会受到数据分布的变化,称为概念漂移。鉴于上述所有原因,从流中学习通常是在线的,并且在记忆消耗和运行时的限制下。尽管存在许多分类算法,但在该地区发表的大多数作品都使用幼稚的贝叶斯(NB)和Hoeffding树(HT)作为基础学习者。本文提出了对K-Nearest邻居(K-NN)的深入评估,作为对受概念漂移的数据流进行分类的候选者。它还分析了时间的复杂性和K-NN的两个主要参数,即用于预测(K)和窗口大小(W)的最近邻居的数量。我们比较了K-NN的不同参数值,并在许多数据集中使用和不带有漂移检测器(RDDM)的NB和HT对比。我们提出并回答了10个研究问题,得出的结论是,K-NN是数据流分类的值得候选者,尤其是在运行时限制不太限制的情况下。

Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classification algorithms exist, most of the works published in the area use Naive Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate for data stream classification, especially when the run-time constraint is not too restrictive.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源