论文标题
OneProvenance:从数据库日志中有效提取动态的粗粒物来源[技术报告]
OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]
论文作者
论文摘要
出处编码连接数据集,其一代工作流程和关联元数据的信息(例如,执行查询或执行查询时)。因此,它对广泛的关键治理应用(例如可观察性和审计)具有重要作用。不幸的是,在数据库系统的背景下,由于数据库工作流程的复杂性和庞大的数量,提取粗粒的出处是一个长期存在的问题。最近提出了从查询事件日志中提取的出处,因为原则上可以为出处应用程序提供有意义的出处图。但是,(a)当前的方法为数据库和出处提取工作流程增加了大量开销,并且(b)〜提取出处是嘈杂的,请省略查询执行依赖性,并且对上游应用程序不足。为了解决这些问题,我们介绍了OneProvenance:来自查询事件日志的有效出处提取系统。 OneProvenance通过(a)〜通过有效的日志分析识别查询执行依赖性的独特挑战,(b)通过解释查询依赖性的新事件变换来提取出处,以及(c)〜引入有效的过滤优化。我们详尽的实验分析表明,与最先进的基线相比,OneProvence可以提高提取高达约18倍。我们的优化减少了提取噪声并进一步优化了性能。 Microsoft Purview会大规模部署OneProventhing,并积极支持客户出处提取需求(https://bit.ly/3n2jvgf)。
Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b)~extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce OneProvenance: an efficient provenance extraction system from query event logs. OneProvenance addresses the unique challenges of log-based extraction by (a)~identifying query execution dependencies through efficient log analysis, (b) extracting provenance through novel event transformations that account for query dependencies, and (c)~introducing effective filtering optimizations. Our thorough experimental analysis shows that OneProvenance can improve extraction by up to ~18X compared to state-of-the-art baselines; our optimizations reduce the extraction noise and optimize performance even further. OneProvenance is deployed at scale by Microsoft Purview and actively supports customer provenance extraction needs (https://bit.ly/3N2JVGF).