跨语言二进制源代码与中间表示

论文标题

跨语言二进制源代码与中间表示

Cross-Language Binary-Source Code Matching with Intermediate Representations

论文作者

Gui, Yi, Wan, Yao, Zhang, Hongyu, Huang, Huifang, Sui, Yulei, Xu, Guandong, Shao, Zhiyuan, Jin, Hai

论文摘要

二进制源代码匹配在许多安全性和软件工程相关的任务中起着重要作用，例如恶意软件检测，逆向工程和漏洞评估。当前，通过共同学习在公共向量空间中的二进制代码和源代码的嵌入，已经提出了几种用于二进制源代码匹配的方法。尽管做了很多努力，但现有方法的目标是匹配用单个编程语言编写的二进制代码和源代码。但是，实际上，软件应用程序通常以不同的编程语言编写，以满足不同的要求和计算平台。在维护多语言和多平台应用程序时，跨编程语言匹配二进制和源代码会引入其他挑战。为此，本文提出了跨语言二进制源代码匹配的问题，并为这个新问题开发了一个新数据集。我们提出了一种新颖的方法XLIR，它是一个基于变压器的神经网络，通过学习二进制代码和源代码的中间表示。为了验证XLIR的有效性，在我们策划的数据集之外，对跨语言二进制源代码匹配的两个任务进行了全面的实验，并进行了跨语言源源代码匹配。实验结果和分析表明，我们提出的具有中间表示的XLIR在这两个任务中都显着超过其他最先进的模型。

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题