论文标题
从GitHub提取结构化数据的工具
A Tool to Extract Structured Data from GitHub
论文作者
论文摘要
GitHub存储库包括有关项目贡献者的各种详细信息,提交人的数量及其贡献者,发行,拉请请求,编程语言和问题。但是,不存在开源项目的系统数据集,其中包含有关GitHub上有关知识获取和采矿的存储库的详细信息。在本文中,我们开发了名为GitrePository的工具支持,该工具支持有助于根据建议的模式创建存储库的数据集。在最初的1680个存储库中,数据集托有620个存储库(带有恒星和叉子的基本过滤器)和247个存储库(应用了所有预定义的过滤器后)。该工具提取GitHub存储库的信息,并以CSV的形式保存数据。文件和数据库(.db)文件。
GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the proposed schema. Out of initial 1680 repositories, the dataset hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters). The tool extracts the information of GitHub repositories and saves the data in the form of CSV. files and a database (.DB) file.