基于数据中心中的空间多任务GPU，朝着QoS感知和资源有效的GPU微服务

论文标题

基于数据中心中的空间多任务GPU，朝着QoS感知和资源有效的GPU微服务

Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters

论文作者

Zhang, Wei, Chen, Quan, Fu, Kaihua, Zheng, Ningxin, Huang, Zhiyi, Leng, Jingwen, Li, Chao, Zheng, Wenli, Guo, Minyi

论文摘要

尽管先前的研究集中于基于CPU的微服务，但由于不同的争论模式，它们不适用于基于GPU的微服务。在保证GPU微服务的QoS的同时，优化资源利用率是一项挑战。我们发现开销是由微服务沟通，GPU资源争夺和微服务管道内的吞吐量不平衡引起的。我们提出了Camelot，这是一个运行时系统，该系统管理GPU MicorServices考虑上述因素。在Camelot中，一种基于全局内存的通信机制可实现现场数据共享，从而大大降低了用户查询的端到端潜伏期。我们还提出了两项争夺资源分配策略，可以最大程度地提高支持的服务负载，或者在低负载下将资源使用最小化，同时确保所需的QoS。这两个策略考虑了在为微服务分配资源时的微服务管道效应和运行时GPU资源争议。与最先进的工作相比，Camelot在有限的GPU的情况下将支持的峰负载提高了64.5％，并在低负载下减少了35％的资源使用情况，同时达到了所需的99％ile延迟目标。

While prior researches focus on CPU-based microservices, they are not applicable for GPU-based microservices due to the different contention patterns. It is challenging to optimize the resource utilization while guaranteeing the QoS for GPU microservices. We find that the overhead is caused by inter microservice communication, GPU resource contention and imbalanced throughput within microservice pipeline. We propose Camelot, a runtime system that manages GPU micorservices considering the above factors. In Camelot, a global memory-based communication mechanism enables onsite data sharing that significantly reduces the end-to-end latencies of user queries. We also propose two contention aware resource allocation policies that either maximize the peak supported service load or minimize the resource usage at low load while ensuring the required QoS. The two policies consider the microservice pipeline effect and the runtime GPU resource contention when allocating resources for the microservices. Compared with state-of-the-art work, Camelot increases the supported peak load by up to 64.5% with limited GPUs, and reduces 35% resource usage at low load while achieving the desired 99%-ile latency target.

下载PDF全文

下载文献需遵守相关版权规定

论文标题