论文标题
代码-MVP:学习从多个视图中表示源代码,并具有对比预训练
CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training
论文作者
论文摘要
近年来,人们对代码表示学习的兴趣日益增加,该学习旨在将源代码的语义代表到分布式向量中。当前,已经提出了各种作品来代表来自不同视图的源代码的复杂语义,包括纯文本,抽象语法树(AST)和几种代码图(例如,控制/数据流程图)。但是,他们中的大多数只能独立考虑源代码的单一视图,而忽略了不同视图之间的对应关系。在本文中,我们建议将不同的视图与源代码的自然语言描述集成到具有多视图对比预训练的统一框架中,并将我们的模型称为代码MVP。具体来说,我们首先使用编译器工具提取多个代码视图,并在对比度学习框架下学习互补信息。受汇编类型检查的启发,我们还设计了预训练中的细粒类型推理目标。在五个数据集上的三个下游任务上进行的实验证明了与几个最先进的基线相比,代码MVP的优势。例如,我们分别在自然语言代码检索,代码相似性和代码缺陷检测任务上,就MRR/MAP/MAP/精度指标获得了2.4/2.3/1.1的增益。
Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODE-MVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.