SPT-CODE：学习源代码表示的顺序到序列预训练

论文标题

SPT-CODE：学习源代码表示的顺序到序列预训练

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

论文作者

Niu, Changan, Li, Chuanyi, Ng, Vincent, Ge, Jidong, Huang, Liguo, Luo, Bin

论文摘要

近年来，大型预培训模型在编码表示学习中的成功应用，从而对许多与代码相关的下游任务进行了重大改进。但是，围绕他们在SE任务的应用中存在一些问题。首先，大多数预训练的模型仅着眼于预训练变压器的编码器。但是，对于使用带有编码器架构的模型来解决的生成任务，但是，没有理由在预训练期间应该排除解码器。其次，许多现有的预训练模型，包括T5学习等最新模型，只需重用为自然语言设计的预培训任务即可。此外，要了解与代码相关任务（例如代码摘要）所需的源代码的自然语言描述，现有的预训练任务需要由源代码和相关的自然语言描述组成的双语语料库，这严重限制了预训练的数据量。为此，我们提出了SPT-Code，这是源代码的序列到序列的预训练模型。为了以序列到序列的方式进行训练SPT编码，并解决与现有训练预训练任务相关的上述弱点，我们介绍了三个专门设计的预培训任务，这些任务是专门为SPT编码而设计的，以促进SPT编码，以了解源代码的知识，并在不依靠这些代码的情况下，并最终将这些代码的自然语言描述以及最终构造为这些corpor，并最终将这些构成构造，并构成这些coplus corplo，并最终将这些代码的描述供应。任务。实验结果表明，在微调后，SPT代码在五个与代码相关的下游任务上实现了最先进的性能。

Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题