论文标题
故障注射分析:一种新的发现云计算系统中故障模式的方法
Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems
论文作者
论文摘要
由于事件和硬件组件之间事件和交互的意外组合,云计算系统以复杂和意外的方式失败。故障注射是在受控环境中带出这些故障的有效手段。但是,故障注入实验会产生大量数据,并且手动分析这些数据效率低下且容易出错,因为分析师可能会错过尚未清楚的严重故障模式。本文介绍了一种新的范式(故障注射分析),该范式在注射系统的执行痕迹上应用无监督的机器学习,以简化故障模式的发现和解释。我们在OpenStack Cloud Computing平台上的故障注入实验的背景下评估了所提出的方法,在那里我们表明该方法可以准确地识别出低计算成本的故障模式。
Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.