论文标题
LIDETECTOR:开源软件的许可证不兼容检测
LiDetector: License Incompatibility Detection for Open Source Software
论文作者
论文摘要
开源软件(OSS)许可决定应遵循的条件以重复使用,分发和修改软件。除了广泛使用的许可(例如MIT许可证)外,开发人员还可以自定义自己的许可证(称为自定义许可证),其描述更加灵活。各种许可的存在对理解许可证及其兼容性施加了挑战。为了避免财务和法律风险,必须在集成第三方软件包或重复使用许可证时确保许可兼容性。在这项工作中,我们提出了Lidetector,这是一种有效的工具,可以提取和解释OSS许可(包括官方许可证和定制许可证),并在这些许可中检测许可证不兼容。具体而言,Lidetector引入了一种基于学习的方法,该方法可以自动从任意许可中识别有意义的许可条款,并采用概率的无上下文语法(PCFG)来推断不兼容检测的权利和义务。实验表明,Lidetector的表现优于现有方法,具有93.28%的定期识别精度,对权利和义务推断的精度为91.09%,并且可以有效地检测出与10.06%的FP率和2.56%的FN率的不相容性。此外,与Lidetector一起,我们对1,846个项目的大规模实证研究表明,72.91%的项目遭受了许可不兼容的困扰,包括流行的项目,例如MIT许可证和Apache许可证。我们从不同的利益相关者的角度强调了经验教训,并制定了所有相关数据,并公开提供了复制软件包,以促进后续研究。
Open-source software (OSS) licenses dictate the conditions which should be followed to reuse, distribute, and modify the software. Apart from widely-used licenses such as the MIT License, developers are also allowed to customize their own licenses (called custom licenses), whose descriptions are more flexible. The presence of such various licenses imposes challenges to understanding licenses and their compatibility. To avoid financial and legal risks, it is essential to ensure license compatibility when integrating third-party packages or reusing code accompanied with licenses. In this work, we propose LiDetector, an effective tool that extracts and interprets OSS licenses (including both official licenses and custom licenses), and detects license incompatibility among these licenses. Specifically, LiDetector introduces a learning-based method to automatically identify meaningful license terms from an arbitrary license and employs Probabilistic Context-Free Grammar (PCFG) to infer rights and obligations for incompatibility detection. Experiments demonstrate that LiDetector outperforms existing methods with 93.28% precision for term identification, and 91.09% accuracy for right and obligation inference, and can effectively detect incompatibility with a 10.06% FP rate and 2.56% FN rate. Furthermore, with LiDetector, our large-scale empirical study on 1,846 projects reveals that 72.91% of the projects are suffering from license incompatibility, including popular ones such as the MIT License and the Apache License. We highlighted lessons learned from the perspectives of different stakeholders and made all related data and the replication package publicly available to facilitate follow-up research.