AEON：一种自动评估NLP测试用例的方法

论文标题

AEON：一种自动评估NLP测试用例的方法

AEON: A Method for Automatic Evaluation of NLP Test Cases

论文作者

Huang, Jen-tse, Zhang, Jianping, Wang, Wenxuan, He, Pinjia, Su, Yuxin, Lyu, Michael R.

论文摘要

由于手动测试的劳动密集型性质Oracle构造的性质，已经提出了各种自动化测试技术来提高自然语言处理（NLP）软件的可靠性。从理论上讲，这些技术会突变现有的测试用例（例如带有标签的句子），并假定生成的测试案例保留了等效或相似的语义含义，因此具有相同的标签。但是，实际上，许多生成的测试用例无法保留类似的语义含义，并且是不自然的（例如，语法错误），这会导致高误报率和不自然的测试用例。我们的评估研究发现，最先进的方法（SOTA）方法中有44％的测试用例是错误警报。这些测试用例需要大量的手动检查工作，而不是改进NLP软件，而是在模型培训中使用时甚至可以降低NLP软件。为了解决这个问题，我们建议AEON自动评估NLP测试案例。对于每个生成的测试用例，它都会根据语义相似性和语言自然性输出分数。我们采用AEON评估由三个典型NLP任务中五个数据集中的四种流行测试技术生成的测试用例。结果表明，永翁与人类的判断保持一致。特别是，Aeon在检测语义不一致的测试用例方面达到了最佳的平均精度，表现优于最佳基线指标10％。此外，AEON还具有最高的平均精度，可以找到不自然的测试案例，超过了15％以上。此外，AEON优先使用测试用例的模型培训会导致模型更准确，更强大，这表明了AEON在改善NLP软件方面的潜力。

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON's potential in improving NLP software.

下载PDF全文

下载文献需遵守相关版权规定

论文标题