大规模云平台中的异常检测

论文标题

大规模云平台中的异常检测

Anomaly Detection in a Large-scale Cloud Platform

论文作者

Islam, Mohammad Saiful, Pourmajidi, William, Zhang, Lei, Steinbacher, John, Erwin, Tony, Miranskyy, Andriy

论文摘要

云计算无处不在：越来越多的公司将工作负载移至云中。但是，这种流行性的兴起挑战了云服务提供商，因为他们需要有效地监视其不断增长的产品的质量。为了应对挑战，我们为IBM云平台设计并实施了一个自动监视系统。该监测系统利用深度学习的神经网络同时在多个平台组件中近实时检测异常。运行系统一年后，我们观察到，提议的解决方案可以使DevOps团队的时间和人力资源从手动监控数千个云组件中。此外，它通过降低云中断的风险来提高客户满意度。在本文中，我们共享解决方案的体系结构，实施说明以及在不断发展监视系统时出现的最佳实践。其他研究人员和从业人员可以利用它们来为复杂系统构建异常检测器。

Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题