您看到的是什么意思！基于可视化和转移学习的代码的语义表示学习

论文标题

您看到的是什么意思！基于可视化和转移学习的代码的语义表示学习

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

论文作者

Keller, Patrick, Plein, Laura, Bissyandé, Tegawendé F., Klein, Jacques, Traon, Yves Le

论文摘要

NLP任务的培训单词嵌入式的最新成功鼓励了有关源代码的代表学习的一波研究，该研究以类似的NLP方法为基础。然后，总体目标是产生代码嵌入，以捕获程序语义的最大值。最先进的方法总是依赖于句法表示（即原始的词汇令牌，抽象的语法树或中间表示代币）来生成嵌入，在文献中被批评为非体或不可替代的文献。在这项工作中，我们根据源代码具有语义的视觉模式的直觉研究了一种新颖的嵌入方法。我们进一步使用这些模式来应对识别语义代码克隆的杰出挑战。我们提出了Wysiwim（“您所看到的就是它的含义”）方法，其中源代码的视觉表示从计算机愿景领域馈入强大的预训练的预训练的图像分类神经网络，从而受益于转移学习的实际优势。我们评估了针对语义代码克隆识别任务的两种变体的拟议嵌入方法：代码克隆检测（二进制分类问题）和代码分类（一个多分类问题）。我们通过在BigCloneBench（Java）和Open法官（C）数据集上进行的实验表明，尽管很简单，但我们的Wysiwim方法的表现与ASTNN或TBCNN等最新方法一样有效。我们进一步探讨了不同步骤在我们的方法中的影响，例如视觉表示或分类算法的选择，最终讨论了该研究方向的承诺和局限性。

Recent successes in training word embeddings for NLP tasks have encouraged a wave of research on representation learning for source code, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximum of program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstract syntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robust or non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visual patterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. We propose the WYSIWIM ("What You See Is What It Means") approach where visual representations of source code are fed into powerful pre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transfer learning. We evaluate the proposed embedding approach on two variations of the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (a multi-classification problem). We show with experiments on the BigCloneBench (Java) and Open Judge (C) datasets that although simple, our WYSIWIM approach performs as effectively as state of the art approaches such as ASTNN or TBCNN. We further explore the influence of different steps in our approach, such as the choice of visual representations or the classification algorithm, to eventually discuss the promises and limitations of this research direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题