论文标题

结合AI和AM-通过变压器网络改善近似匹配

Combining AI and AM - Improving Approximate Matching through Transformer Networks

论文作者

Uhlig, Frieder, Struppek, Lukas, Hintersdorf, Dominik, Göbel, Thomas, Baier, Harald, Kersting, Kristian

论文摘要

近似匹配(AM)是数字取证中的一个概念,可以确定数字工件之间的相似性。 AM的一个重要用例是,如果只有原件的片段可用,则可靠,有效地检测黑名单上的相关数据结构。例如,如果在数字法医调查过程中仍然只存在一系列索引恶意软件,则AM算法应能够将片段分配给黑名单的恶意软件。但是,如果与整体文件大小相比,如果提出的件相对较小,则传统的AM功能(例如TLSH和SSDEEP)无法根据其片段检测文件。传统AM算法的第二个众所周知的问题是,由于越来越多的查找数据库,缺乏扩展。我们根据自然语言处理领域的变压器模型提出了一种改进的匹配算法。我们称我们的方法为深度学习近似匹配(DLAM)。作为人工智能(AI)的一个概念,DLAM在训练阶段获得了特征性黑名单模式的知识。然后,DLAM能够检测到通常更大的文件中的模式,即DLAM的重点是碎片检测的用例。我们透露,与著名的常规方法TLSH和SSDEEP相比,DLAM具有三个关键优势。首先,它使已知的不良零件的繁琐提取过时,这是到目前为止必须在使用AM算法搜索它们之前必须进行的。这允许在更大范围内对文件进行有效分类,这很重要,这是由于指数增加了要研究的数据。其次,根据用例,DLAM在恢复黑名单文件的片段时达到了相似甚至更高的精度。第三,我们表明DLAM可以在TLSH和SSDEEP的输出中检测文件相关,即使对于小片段尺寸也是如此。

Approximate matching (AM) is a concept in digital forensics to determine the similarity between digital artifacts. An important use case of AM is the reliable and efficient detection of case-relevant data structures on a blacklist, if only fragments of the original are available. For instance, if only a cluster of indexed malware is still present during the digital forensic investigation, the AM algorithm shall be able to assign the fragment to the blacklisted malware. However, traditional AM functions like TLSH and ssdeep fail to detect files based on their fragments if the presented piece is relatively small compared to the overall file size. A second well-known issue with traditional AM algorithms is the lack of scaling due to the ever-increasing lookup databases. We propose an improved matching algorithm based on transformer models from the field of natural language processing. We call our approach Deep Learning Approximate Matching (DLAM). As a concept from artificial intelligence (AI), DLAM gets knowledge of characteristic blacklisted patterns during its training phase. Then DLAM is able to detect the patterns in a typically much larger file, that is DLAM focuses on the use case of fragment detection. We reveal that DLAM has three key advantages compared to the prominent conventional approaches TLSH and ssdeep. First, it makes the tedious extraction of known to be bad parts obsolete, which is necessary until now before any search for them with AM algorithms. This allows efficient classification of files on a much larger scale, which is important due to exponentially increasing data to be investigated. Second, depending on the use case, DLAM achieves a similar or even significantly higher accuracy in recovering fragments of blacklisted files. Third, we show that DLAM enables the detection of file correlations in the output of TLSH and ssdeep even for small fragment sizes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源