论文标题
构建用于可重复性和可追溯性科学工作流程的容器化环境
Building Containerized Environments for Reproducibility and Traceability of Scientific Workflows
论文作者
论文摘要
科学家依靠模拟来研究自然现象。信任模拟结果对于在任何领域发展科学至关重要。建立信任的一种方法是通过在系统级别的执行注释来确保模拟的可重复性和可追溯性;通过在模拟工作流中移动的数据的记录跟踪。在这项工作中,我们提出了一个系统级解决方案,该解决方案利用容器的内在特征(即便携性,隔离,封装和唯一标识符)。我们的解决方案由一个容器化的环境组成,能够注释工作流,捕获出处元数据并建立记录踪迹。我们在四个不同的工作流程上评估环境,并在时间和空间方面衡量容器化成本。我们的解决方案以可耐受的时间和空间开销构建,可实现透明和自动的出处元数据收集和访问,易于阅读的记录跟踪以及数据与元数据之间的紧密连接。
Scientists rely on simulations to study natural phenomena. Trusting the simulation results is vital to develop sciences in any field. One approach to build trust is to ensure the reproducibility and traceability of the simulations through the annotation of executions at the system-level; by the generation of record trails of data moving through the simulation workflow. In this work, we present a system-level solution that leverages the intrinsic characteristics of containers (i.e., portability, isolation, encapsulation, and unique identifiers). Our solution consists of a containerized environment capable to annotate workflows, capture provenance metadata, and build record trails. We assess our environment on four different workflows and measure containerization costs in terms of time and space. Our solution, built with a tolerable time and space overhead, enables transparent and automatic provenance metadata collection and access, an easy-to-read record trail, and tight connections between data and metadata.