论文标题
通过模式草图,在线服务系统的自适应性能异常检测
Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching
论文作者
论文摘要
为了确保在线服务系统的性能,通过各种软件和系统指标密切监视其状态。性能异常代表服务系统的性能降解问题(例如,缓慢响应)。在对指标进行异常检测时,现有方法通常缺乏可解释性的优点,这对于工程师和分析师采取补救措施至关重要。此外,他们无法以在线方式有效地适应不断变化的服务。为了解决这些局限性,在本文中,我们提出了Adsketch,这是一种基于模式草图的可解释和自适应性能异常检测方法。 Adsketch通过识别代表特定类型的性能问题的异常度量模式组来实现可解释性。如果再次出现类似模式,则可以立即识别基本问题。此外,自适应学习算法旨在包含由服务更新或用户行为改变引起的前所未有的模式。通过公共数据以及从华为云中代表性的在线服务系统收集的工业数据对所提出的方法进行评估。实验结果表明,Adsketch的表现优于最先进的方法,并证明了在线算法在新模式发现中的有效性。此外,我们的方法已成功地部署在工业实践中。
To ensure the performance of online service systems, their status is closely monitored with various software and system metrics. Performance anomalies represent the performance degradation issues (e.g., slow response) of the service systems. When performing anomaly detection over the metrics, existing methods often lack the merit of interpretability, which is vital for engineers and analysts to take remediation actions. Moreover, they are unable to effectively accommodate the ever-changing services in an online fashion. To address these limitations, in this paper, we propose ADSketch, an interpretable and adaptive performance anomaly detection approach based on pattern sketching. ADSketch achieves interpretability by identifying groups of anomalous metric patterns, which represent particular types of performance issues. The underlying issues can then be immediately recognized if similar patterns emerge again. In addition, an adaptive learning algorithm is designed to embrace unprecedented patterns induced by service updates or user behavior changes. The proposed approach is evaluated with public data as well as industrial data collected from a representative online service system in Huawei Cloud. The experimental results show that ADSketch outperforms state-of-the-art approaches by a significant margin, and demonstrate the effectiveness of the online algorithm in new pattern discovery. Furthermore, our approach has been successfully deployed in industrial practice.