论文标题
建立对AIOPS模型的一致解释
Towards a consistent interpretation of AIOps models
论文作者
论文摘要
IT操作的人工智能(AIOPS)已在各种任务的组织中采用,包括解释模型以识别服务失败的指标。为了避免误导从业人员,AIOPS模型的解释应该是一致的(即,在同一任务上的不同AIOPS模型彼此相同的特征重要性一致)。但是,许多AIOPS研究违反了机器学习社区的既定实践,例如以次优性能的解释模型,尽管尚未研究这种违规对解释一致性的影响。在本文中,我们研究了AIOPS模型沿三个维度的一致性:内部一致性,外部一致性和时间一致性。我们对两项AIOPS任务进行了案例研究:预测Google群集作业失败,并进行硬盘驱动器失败。我们发现,应控制学习者,超参数调整和数据采样的随机性,以产生一致的解释。与表现低的模型相比,AUCS大于0.75的AIOPS模型产生了更一致的解释。最后,使用滑动窗口或完整历史记录方法构建的AIOPS模型具有最一致的解释,并在整个数据集中呈现出趋势。我们的研究为从业人员提供了一致的AIOPS模型解释提供了宝贵的指南。
Artificial Intelligence for IT Operations (AIOps) has been adopted in organizations in various tasks, including interpreting models to identify indicators of service failures. To avoid misleading practitioners, AIOps model interpretations should be consistent (i.e., different AIOps models on the same task agree with one another on feature importance). However, many AIOps studies violate established practices in the machine learning community when deriving interpretations, such as interpreting models with suboptimal performance, though the impact of such violations on the interpretation consistency has not been studied. In this paper, we investigate the consistency of AIOps model interpretation along three dimensions: internal consistency, external consistency, and time consistency. We conduct a case study on two AIOps tasks: predicting Google cluster job failures, and Backblaze hard drive failures. We find that the randomness from learners, hyperparameter tuning, and data sampling should be controlled to generate consistent interpretations. AIOps models with AUCs greater than 0.75 yield more consistent interpretation compared to low-performing models. Finally, AIOps models that are constructed with the Sliding Window or Full History approaches have the most consistent interpretation with the trends presented in the entire datasets. Our study provides valuable guidelines for practitioners to derive consistent AIOps model interpretation.