recode：代码生成模型的鲁棒性评估

论文标题

recode：代码生成模型的鲁棒性评估

ReCode: Robustness Evaluation of Code Generation Models

论文作者

Wang, Shiqi, Li, Zheng, Qian, Haifeng, Yang, Chenghao, Wang, Zijian, Shang, Mingyue, Kumar, Varun, Tan, Samson, Ray, Baishakhi, Bhatia, Parminder, Nallapati, Ramesh, Ramanathan, Murali Krishna, Roth, Dan, Xiang, Bing

论文摘要

代码生成模型取得了令人印象深刻的性能。但是，它们往往会脆弱，因为对提示的轻微编辑可能导致几代人不同。这些鲁棒性属性对在现实生活应用程序中部署时对用户体验至关重要。关于文本或代码任务中鲁棒性的大多数现有作品都集中在分类上，而生成任务的鲁棒性是一个未知的领域，迄今为止，尚无代表代码生成的鲁棒性的全面基准。在本文中，我们提出了Recode，这是代码生成模型的全面鲁棒性评估基准。我们为DocStrings，功能和可变名称，代码语法和代码格式的代码定制了30多个转换。它们经过精心设计，可以在现实生活的编码实践中自然设计，保留原始的语义含义，从而对模型的稳健性性能进行多方面的评估。使用人类注释者，我们验证了超过90％的扰动提示不会改变原始提示的语义含义。此外，我们为代码生成模型定义了鲁棒性指标，考虑到每种类型的扰动下的最坏情况行为，利用执行生成的代码可以作为客观评估的事实。我们使用HOMANEVAL，MBPP以及从中得出的功能完成任务进行了对SOTA模型的重新编码。有趣的观察结果包括：与Invoder和GPT-J相比，Codegen的鲁棒性更好；模型对语法扰动最敏感；对MBPP对人类事件的更具挑战性的鲁棒性评估。

Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.

下载PDF全文

下载文献需遵守相关版权规定

论文标题