CCTEST：测试和修复代码完成系统

论文标题

CCTEST：测试和修复代码完成系统

CCTEST: Testing and Repairing Code Completion Systems

论文作者

Li, Zongjie, Wang, Chaozheng, Liu, Zhibo, Wang, Haoxuan, Chen, Dong, Wang, Shuai, Gao, Cuiyun

论文摘要

在软件开发领域中，代码完成是一个非常有价值的主题，它越来越多地促进了大型语言模型（LLM）的最新进展。迄今为止，使用大量的非结构化文本和开源代码对Github Copilot和GPT等可见的代码完成框架（例如GitHub Copilot和GPT）进行了培训。作为日常编程任务中的最重要组件和基石，代码完成在很大程度上提高了专业人员在构建现实世界软件系统方面的效率。与这个蓬勃发展的市场相反，我们发现代码完成系统通常会产生可疑结果，迄今为止，尚无代码完成系统的自动测试和增强框架。这项研究提出了CCTEST，这是一个测试和修复Blackbox设置中的框架。 Cctest具有一组新型突变策略，即程序结构相关（PSC）突变，以产生突变的代码完成输入。然后，它从所有完成的代码案例中检测出不一致的输出，代表可能错误的情况。此外，Cctest通过选择大多反映所有输出情况的“平均”外观的输出来修复代码完成输出，这是代码完成系统的最终输出。我们检测到总共33,540个输入（真正的正率为86％），这些输入可以触发八个流行的基于LLM的代码完成系统的错误情况。通过维修，我们表明，对于BLEU分数和Levenshtein编辑相似性，代码完成系统的准确性显着提高了40％和67％。

Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals' efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in blackbox settings. CCTEST features a set of novel mutation strategies, namely program structure-correlated (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the "average" appearance of all output cases, as the final output of the code completion systems. We detected a total of 33,540 inputs (with a true positive rate of 86%) that can trigger erroneous cases from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题