使用系统级依赖图的自动原因分析延迟异常值

论文标题

使用系统级依赖图的自动原因分析延迟异常值

Automated Cause Analysis of Latency Outliers Using System-Level Dependency Graphs

论文作者

Patel, Sneh, Park, Brendan, Ezzati-Jivan, Naser, Fournier, Quentin

论文摘要

检测性能问题并在运行时确定其根本原因是一项艰巨的任务。通常，开发人员使用日志记录和跟踪等方法来识别瓶颈。但是，这些解决方案并不理想，因为它们耗时并需要手动努力。在本文中，我们提出了一种使用系统级轨迹来检测潜伏期异常值的任务，然后比较它们以识别根本原因。我们的方法利用依赖图来显示线程和系统资源之间的内部交互。使用这些图，可以查明发生性能问题的地方。但是，单个跟踪可以由大量请求组成，每个请求生成一个图。为了自动化数据集中的异常值的任务，我们使用基于机器学习密度的模型和统计计算，例如-Score。我们的评估表明，在离群值检测方面的准确性大于97％，使其适用于生产服务器和行业级别用例。

Detecting performance issues and identifying their root causes in the runtime is a challenging task. Typically, developers use methods such as logging and tracing to identify bottlenecks. These solutions are, however, not ideal as they are time-consuming and require manual effort. In this paper, we propose a method to automate the task of detecting latency outliers using system-level traces and then comparing them to identify the root cause(s). Our method makes use of dependency graphs to show internal interactions between threads and system resources. With these graphs, one can pinpoint where performance issues occur. However, a single trace can be composed of a large number of requests, each generating one graph. To automate the task of identifying outliers within the dataset, we use machine learning density-based models and statistical calculations such as -score. Our evaluation shows an accuracy greater than 97 % on outlier detection, making them appropriate for in-production servers and industry-level use cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题