下码：通过预测子树的自我监督对代码表示形式的学习

论文标题

下码：通过预测子树的自我监督对代码表示形式的学习

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

论文作者

Bui, Nghi D. Q., Yu, Yijun, Jiang, Lingxiao

论文摘要

在源代码上构建深度学习模型发现了许多成功的软件工程应用程序，例如代码搜索，代码注释生成，错误检测，代码迁移等等。但是，当前的学习技术有一个主要的缺点，即这些模型主要是在标记为特定下游任务的数据集上培训的，并且代码表示可能不适合其他任务。尽管某些技术从未标记的代码产生表示形式，但应用于下游任务时，它们远非令人满意。尽管某些技术在应用于下游任务时从未标记的代码产生表示，但它们远非令人满意。本文提出了通过调整自我监督的学习机制来构建源代码模型来克服限制的次代码。关键的新颖性在于训练代码表示，通过预测从AST的上下文中自动识别子树。 AST中的子树以欠码作为训练代码表示的标签处理，而无需任何人类标签工作或昂贵的图形构造的开销，并且训练有素的表示形式不再与任何特定的下游任务或代码单位绑定。我们使用基于树的CNN作为大量Java代码的编码器培训了一个欠码模型实例，并将其应用于下游无监督的任务，例如代码群集，代码clustering，代码克隆检测，跨语言代码搜索或在转移学习方面重复使用，以继续培训模型权重，以训练模型权重，以供诸如代码分类和方法名称名称预测。与以前的代码学习技术相比，应用于相同的下游任务，例如Code2Vec，Code2Seq，ASTNN，使用我们的预训练的地下代码模型可以实现较高的性能结果，对于大多数任务，包括涉及不同编程语言的任务，并具有很大的余量。

Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题