论文标题
TSSB-3M:大规模采矿单语句错误
TSSB-3M: Mining single statement bugs at massive scale
论文作者
论文摘要
单语句错误是评估现代错误检测和自动程序维修方法的最重要成分之一。通过仅影响单个语句,单个语句错误代表开发人员经常忽略的一种错误,同时仍然足够小,可以通过自动方法检测和修复。随着数据驱动的自动维修的兴起,单个语句错误的可用性比以往任何时候都更重要。不仅用于测试这些方法,还用于提供足够的现实世界示例进行培训。为了提供对此量表的错误修复数据集的访问,我们正在发布两个称为SSB-9M和TSSB-3M的数据集。 SSB-9M提供了超过500K开源Python项目的9M一般单个语句错误修复的访问,但TSSB-3M专注于超过3M的单个语句错误,这些错误仅通过单个语句更改来解决。为了促进未来的研究和实证研究,我们用典型的Python典型的20个单个语句错误(SSTUB)模式之一来注释每个错误修复,以及代码更改的表征作为一系列AST修改。我们的初步调查表明,所有单个语句错误修复了至少一个SSTUB模式中的至少40%,并且所有错误中72%的大多数可以通过固定SSTUB所需的相同的句法修改来固定。
Single statement bugs are one of the most important ingredients in the evaluation of modern bug detection and automatic program repair methods. By affecting only a single statement, single statement bugs represent a type of bug often overlooked by developers, while still being small enough to be detected and fixed by automatic methods. With the rise of data-driven automatic repair the availability of single statement bugs at the scale of millionth of examples is more important than ever; not only for testing these methods but also for providing sufficient real world examples for training. To provide access to bug fix datasets of this scale, we are releasing two datasets called SSB-9M and TSSB-3M. While SSB-9M provides access to a collection of over 9M general single statement bug fixes from over 500K open source Python projects , TSSB-3M focuses on over 3M single statement bugs which can be fixed solely by a single statement change. To facilitate future research and empirical investigations, we annotated each bug fix with one of 20 single statement bug (SStuB) patterns typical for Python together with a characterization of the code change as a sequence of AST modifications. Our initial investigation shows that at least 40% of all single statement bug fixes mined fit at least one SStuB pattern, and that the majority of 72% of all bugs can be fixed with the same syntactic modifications as needed for fixing SStuBs.