BINMLM：二元作者身份验证，具有流动性混合物模型

论文标题

BINMLM：二元作者身份验证，具有流动性混合物模型

BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model

论文作者

Song, Qige, Zhang, Yongzheng, Ouyang, Linshu, Chen, Yige

论文摘要

在许多软件工程应用程序中，二元作者分析是一个重要的问题。在本文中，我们制定了一项二进制作者身份验证任务，以准确反映软件法医专家的实际工作过程。它旨在确定是否由具有少量支持样本的特定程序员开发匿名二进制文件，而实际的开发人员可能不属于已知的候选人集，而是来自野外。我们提出了一个有效的二元作者身份验证框架Binmlm。 BINMLM在从控制流程（CFG）中提取的连续OpCode轨迹上训练RNN语言模型，以表征候选人开发人员的编程样式。我们与多个共享编码器和特定于作者的门层建立了共享的体系结构，可以学习开发人员对通用编程模式的组合偏好，并减轻低训练资源的问题。通过优化外部预训练，联合训练和微调的管道，我们的框架可以消除额外的噪声，并准确提炼开发人员的独特风格。广泛的实验表明，BINMLM在Google Code Jam（GCJ）上取得了有希望的结果，并具有不同数量的程序员和支持样本的CodeForces数据集。它的表现明显胜过基于最先进的功能集（4.73％至19.46％的提高），并且在多作者协作方案中保持强劲。此外，BINMLM可以在现实世界中的APT恶意软件数据集上执行组织级别的验证，该数据集可以提供有价值的辅助信息，以探索APT攻击背后的组。

Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73% to 19.46% improvement) and remains robust in multi-author collaboration scenarios. Furthermore, BinMLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.

下载PDF全文

下载文献需遵守相关版权规定

论文标题