论文标题

JARVIS:自适应近数据处理的大型服务器监视

Jarvis: Large-scale Server Monitoring with Adaptive Near-data Processing

论文作者

Sandur, Atul, Park, ChanHo, Volos, Stavros, Agha, Gul, Jeon, Myeongjae

论文摘要

对于大规模在线服务而言,快速检测和缓解影响性能和可靠性的问题至关重要。为了实时检测此类问题,数据中心操作员使用流处理器并分析从服务器(称为数据源节点)及其托管服务收集的数据流。传入流的及时处理需要网络传输大量数据,并大量计算资源来处理它。这些因素通常会为流分析创造瓶颈。为了帮助克服这些瓶颈,当前的监视系统通过基于成本模型计算最佳查询分区或使用模型 - 不合Snostic启发式方法来采用近数据处理。最佳分区在计算上是昂贵的,而模型不固定的启发式方法是迭代的,并且可以在大型解决方案空间上进行搜索。我们通过使用模型不稳定的启发式方法来结合这些方法,以改善基于模型的启发式的分配解决方案。此外,当前系统使用操作员级别的分区:如果数据源没有足够的资源来在所有记录上执行操作员,则仅在流处理器上执行操作员。相反,我们执行数据级分区,即,我们允许在流处理器和数据源上执行操作员。我们在称为JARVIS的系统中实现算法,该系统可以快速适应动态资源条件。我们对各种监视工作负载的评估表明,Jarvis在节点资源条件变化的几秒钟内收敛到稳定的查询分区。与当前的分区策略相比,贾维斯(Jarvis)处理多达75%的数据源,同时将资源受限方案的吞吐量提高1.2-4.4倍。

Rapid detection and mitigation of issues that impact performance and reliability is paramount for large-scale online services. For real-time detection of such issues, datacenter operators use a stream processor and analyze streams of monitoring data collected from servers (referred to as data source nodes) and their hosted services. The timely processing of incoming streams requires the network to transfer massive amounts of data, and significant compute resources to process it. These factors often create bottlenecks for stream analytics. To help overcome these bottlenecks, current monitoring systems employ near-data processing by either computing an optimal query partition based on a cost model or using model-agnostic heuristics. Optimal partitioning is computationally expensive, while model-agnostic heuristics are iterative and search over a large solution space. We combine these approaches by using model-agnostic heuristics to improve the partitioning solution from a model-based heuristic. Moreover, current systems use operator-level partitioning: if a data source does not have sufficient resources to execute an operator on all records, the operator is executed only on the stream processor. Instead, we perform data-level partitioning, i.e., we allow an operator to be executed both on a stream processor and data sources. We implement our algorithm in a system called Jarvis, which enables quick adaptation to dynamic resource conditions. Our evaluation on a diverse set of monitoring workloads suggests that Jarvis converges to a stable query partition within seconds of a change in node resource conditions. Compared to current partitioning strategies, Jarvis handles up to 75% more data sources while improving throughput in resource-constrained scenarios by 1.2-4.4x.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源