论文标题
Arclin:自动化的API提及未格式化文本的分辨率
ARCLIN: Automated API Mention Resolution for Unformatted Texts
论文作者
论文摘要
在线技术论坛(例如Stackoverflow)是开发人员讨论技术问题的流行平台,例如如何使用特定的应用程序编程接口(API),如何求解编程任务或如何在其代码中修复错误。这些讨论通常可以提供有关如何使用官方文件未涵盖的软件的辅助知识。自动提取此类知识将支持一组下游任务,例如API搜索或索引。但是,与专家撰写的官方文档不同,公开论坛中的讨论是由常规开发人员进行的,这些开发人员在简短而非正式的文本中撰写,包括拼写错误或缩写。对于准确的API识别和提到的API从非结构化的自然语言文档到API存储库中的条目,有三个主要的挑战:(1)将API与通用单词区分开来; (2)识别API提及没有完全合格的名称; (3)歧义API用相似的方法名称提到,但在其他库中。在本文中,为了应对这些挑战,我们提出了一种Arclin工具,该工具可以在不使用人类注释的情况下有效区分和链接API。具体而言,我们首先设计了API识别器,以自动提取API从有条件的随机字段(CRF)中提取自然语言句子,该句子是在双向长期短期内存(BI-LSTM)模块的顶部提取的,然后我们应用上下文意识到的得分机制来计算API reposority中每个条目的提及相似性。与以前使用启发式规则的方法相比,在高质量的数据集PY-Mention中,我们提出的没有手动检查的工具优于8%,其中包含558个提及和2,830个句子,来自五个流行的Python图书馆。
Online technical forums (e.g., StackOverflow) are popular platforms for developers to discuss technical problems such as how to use specific Application Programming Interface (API), how to solve the programming tasks, or how to fix bugs in their codes. These discussions can often provide auxiliary knowledge of how to use the software that is not covered by the official documents. The automatic extraction of such knowledge will support a set of downstream tasks like API searching or indexing. However, unlike official documentation written by experts, discussions in open forums are made by regular developers who write in short and informal texts, including spelling errors or abbreviations. There are three major challenges for the accurate APIs recognition and linking mentioned APIs from unstructured natural language documents to an entry in the API repository: (1) distinguishing API mentions from common words; (2) identifying API mentions without a fully qualified name; and (3) disambiguating API mentions with similar method names but in a different library. In this paper, to tackle these challenges, we propose an ARCLIN tool, which can effectively distinguish and link APIs without using human annotations. Specifically, we first design an API recognizer to automatically extract API mentions from natural language sentences by a Conditional Random Field (CRF) on the top of a Bi-directional Long Short-Term Memory (Bi-LSTM) module, then we apply a context-aware scoring mechanism to compute the mention-entry similarity for each entry in an API repository. Compared to previous approaches with heuristic rules, our proposed tool without manual inspection outperforms by 8% in a high-quality dataset Py-mention, which contains 558 mentions and 2,830 sentences from five popular Python libraries.