论文标题
基于MDL的压缩顺序规则
MDL-based Compressing Sequential Rules
论文作者
论文摘要
如今,随着互联网的快速发展,大数据的时代已经到来。互联网每天都会产生大量数据。但是,从大量数据中提取有意义的信息就像在干草堆中寻找针头一样。数据挖掘技术可以提供各种可行的方法来解决此问题。目前,介绍了许多顺序规则挖掘(SRM)算法,以在具有顺序特征的数据库中找到顺序规则。这些规则可帮助人们从大量数据中提取许多有意义的信息。我们如何才能实现挖掘结果的压缩并减少数据大小以节省存储空间和传输时间?到目前为止,关于SRM压缩的研究很少。在本文中,结合最小描述长度(MDL)原理以及在两个指标(支持和置信度)下,我们介绍了SRM压缩的问题,还提出了一种名为COMSR的解决方案,用于基于MDL的基于MDL的压缩基于设计的顺序规则编码方案。据我们所知,我们是第一个使用顺序规则编码整个数据库的人。提出了一种启发式方法,以尽可能地找到一组紧凑而有意义的顺序规则。 COMSR根据数据库是否可以完全压缩,具有两种权衡算法,即comsr_non和comsr_ful。在具有不同阈值的真实数据集上进行的实验表明,可以找到一组紧凑而有意义的顺序规则。这表明所提出的方法有效。
Nowadays, with the rapid development of the Internet, the era of big data has come. The Internet generates huge amounts of data every day. However, extracting meaningful information from massive data is like looking for a needle in a haystack. Data mining techniques can provide various feasible methods to solve this problem. At present, many sequential rule mining (SRM) algorithms are presented to find sequential rules in databases with sequential characteristics. These rules help people extract a lot of meaningful information from massive amounts of data. How can we achieve compression of mined results and reduce data size to save storage space and transmission time? Until now, there has been little research on the compression of SRM. In this paper, combined with the Minimum Description Length (MDL) principle and under the two metrics (support and confidence), we introduce the problem of compression of SRM and also propose a solution named ComSR for MDL-based compressing of sequential rules based on the designed sequential rule coding scheme. To our knowledge, we are the first to use sequential rules to encode an entire database. A heuristic method is proposed to find a set of compact and meaningful sequential rules as much as possible. ComSR has two trade-off algorithms, ComSR_non and ComSR_ful, based on whether the database can be completely compressed. Experiments done on a real dataset with different thresholds show that a set of compact and meaningful sequential rules can be found. This shows that the proposed method works.