论文标题
通过终生的实验数据库(LDE)启用可重复性和元学习能力
Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE)
论文作者
论文摘要
人工智能(AI)的发展本质上是迭代的和实验性的。在正常开发过程中,尤其是随着自动化AI的出现,生成了数百或数千个实验,并且经常丢失或再也没有检查过。记录这些实验并大规模向它们学习的机会失去了机会,但是跟踪和再现这些实验的复杂性通常对数据科学家来说通常是令人难以置信的。我们介绍了实验的终身数据库(LDE),该数据库自动从实验工件中提取和存储链接的元数据,并提供了重现这些伪像并在其中进行元学习的功能。我们从AI开发生命周期的多个阶段存储上下文,包括数据集,管道,每个配置方式以及培训以及有关其运行时环境的信息。存储的元数据的标准化性质允许查询和聚集,尤其是在绩效指标对工件排名的方面。我们通过重现现有的元学习研究并将复制的元数据存储在我们的系统中,从而展示了LDE的功能。然后,我们对此元数据进行了两个实验:1)检查性能指标的可重复性和可变性,以及2)在数据之外实施多种元学习算法并检查实验结果的可变性如何影响建议性能。实验结果表明性能的显着差异,特别是取决于数据集配置。当元学习构建在结果之上时,这种变化会延续下来,使用汇总结果时的性能会提高。这表明一种自动收集和汇总结果的系统,例如LDE不仅有助于实施元学习,还可以提高其性能。
Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.