识别恶意软件数据中有意义的群集

论文标题

识别恶意软件数据中有意义的群集

Identifying meaningful clusters in malware data

论文作者

de Amorim, Renato Cordeiro, Ruiz, Carlos David Lopez

论文摘要

在逐载恶意软件数据中找到有意义的群集是一项特别困难的任务。恶意软件数据倾向于包含重叠的群集，其基数差异很大。之所以发生这种情况，是因为恶意软件样本之间可能存在相当大的相似性（有些甚至属于同一家族），而这些往往会出现。聚类算法通常应用于归一化数据集。但是，归一化的过程旨在设置具有不同范围值的特征，以对聚类具有相似的贡献。与那些有意义的功能相比，它不利于更有意义的功能，也许应该期望数据预处理阶段的效果。在本文中，我们介绍了一种精确处理上述问题的方法。这是一种迭代数据预处理方法，能够帮助增加簇之间的分离。它通过计算每个功能的集群内部相关程度，然后将其用作数据重新缩放因素。通过重复此操作直到收敛，我们的恶意软件数据被清晰地分离，从而导致平均轮廓宽度更高。

Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to normalised data sets. However, the process of normalisation aims at setting features with different range values to have a similar contribution to the clustering. It does not favour more meaningful features over those that are less meaningful, an effect one should perhaps expect of the data pre-processing stage. In this paper we introduce a method to deal precisely with the problem above. This is an iterative data pre-processing method capable of aiding to increase the separation between clusters. It does so by calculating the within-cluster degree of relevance of each feature, and then it uses these as a data rescaling factor. By repeating this until convergence our malware data was separated in clear clusters, leading to a higher average silhouette width.

下载PDF全文

下载文献需遵守相关版权规定

论文标题