论文标题
CODEPAD:基于序列的代码生成,使用自动机
CodePAD: Sequence-based Code Generation with Pushdown Automaton
论文作者
论文摘要
在代码生成过程中,必须确保生成的代码满足编程语言(PL)的语法限制。但是,忽略语法约束是基于序列的代码生成的致命缺点。在本文中,我们设计了一种基于下降的自动机(PDA)方法来解决此问题,并利用PL是PDA识别语言的子集的原理,而PDA接受的代码是语法性的。具体而言,我们构建一个PDA模块并设计了一种算法来限制基于序列的模型的生成以确保语法正确性。在这种方法论的指导下,我们进一步提出了配备了PDA模块的基于序列的代码生成框架Codepad,以将PDA的推论集成到深度学习中。此外,该框架可以利用PDA扣除状态(包括州代表,国家预测任务和与州的联合预测)来协助模型学习PDA扣除。为了全面评估CODEPAD,我们为Python构建了PDA,并在四个公共基准数据集上进行了广泛的实验。 CodePad可以利用现有的基于序列的模型,我们表明它可以在这些基准数据集上实现100 \%的语法正确性百分比。因此,与基本模型相比,它相对改善了Conala上的17 \%Codebleu,Django上的8 \%EM和Juice-10k上的15 \%Codebleu。此外,我们的方法显着增强了预训练的模型,例如,在零摄像机设置中,MBPP上的Codegen-350m的Codebleu从3.21提高到21.54。
In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, neglecting grammar constraints is a fatal drawback of commonly used sequence-based code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage existing sequence-based models, and we show that it can achieve 100\% grammatical correctness percentage on these benchmark datasets. Thus, it relatively improve 17\% CodeBLEU on CONALA, 8\% EM on DJANGO, and 15\% CodeBLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., CodeBLEU of CodeGen-350M improvement from 3.21 to 21.54 on MBPP in zero-shot setting.