DOCMPROMPTING：通过检索文档来生成代码

论文标题

DOCMPROMPTING：通过检索文档来生成代码

DocPrompting: Generating Code by Retrieving the Docs

论文作者

Zhou, Shuyan, Alon, Uri, Xu, Frank F., Wang, Zhiruo, Jiang, Zhengbao, Neubig, Graham

论文摘要

公开可用的源代码库正在不断增长和改变。这使得代码模型不可能通过简单地在现有代码存储库上训练这些模型来保持所有可用API。因此，现有模型本质上不能推广使用看不见的功能和库，因为这些功能和库永远不会出现在培训数据中。相反，当人类程序员首次使用功能和库时，他们经常指文本资源（例如代码手册和文档）来探索和理解可用功能。受到这一观察的启发，我们介绍了DocPrompting：一种自然语言对代码生成方法，该方法通过（1）通过（1）检索具有NL意图的相关文档作品来明确利用文档，并且（2）基于NL意图和检索的文档生成代码。 DOCMPROMPTING是一般性的：它可以应用于任何编程语言，并不可知对基础神经模型。我们证明，DOCMPROMPTING始终改善NL对代码模型：DOCPROMPTING在Pass@1（相对增益52％）（相对增益52％）和4.39％的Pass@10（30％相对增益）（30％的相对增益）在流行的Python Conala Benchmark上，Pass@10（30％相对增益）的Pass@1（相对增益为30％（30％的相对增益）；在新的BASH数据集TLDR上，DOCMPROMPTING将CODET5和GPT-NEO1.3B提高至绝对6.9％的确切匹配。

Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce DocPrompting: a natural-language-to-code generation approach that explicitly leverages documentation by (1) retrieving the relevant documentation pieces given an NL intent, and (2) generating code based on the NL intent and the retrieved documentation. DocPrompting is general: it can be applied to any programming language and is agnostic to the underlying neural model. We demonstrate that DocPrompting consistently improves NL-to-code models: DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to absolute 6.9% exact match.

下载PDF全文

下载文献需遵守相关版权规定

论文标题