高效且灵活的推理系统，用于服务深度神经网络的异质合奏

论文标题

高效且灵活的推理系统，用于服务深度神经网络的异质合奏

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

论文作者

Pochelu, Pierrick, Petiton, Serge G., Conche, Bruno

论文摘要

深度神经网络（DNN）的集合已经实现了定性预测，但它们是计算和记忆密集型的。因此，需求越来越多，以使他们通过可用的计算资源来回答大量的请求。与最近针对单个DNN的预测推理服务器和推理框架的计划不同，我们提出了一个新的软件层，以灵活性和效率DNNS的效率组成。我们的推理系统设计了几项技术创新。首先，我们提出了一种新的程序，以在设备（CPU或GPU）和DNN实例之间找到一个良好的分配矩阵。它连续运行最差的功能，可以将DNN分配到存储器设备和贪婪的算法中，以优化分配设置并加快合奏。其次，我们根据多个过程设计推理系统以异步运行：批处理，预测和组合规则，具有有效的内部通信方案，以避免开销。实验显示了极端情况下的灵活性和效率：成功地将12个重型DNN的合奏提供在4 GPU中，而在相反的相反，一个单个DNN多线程将其分成16 GPU。它还胜过简单的基线，该基线包括在图像分类任务上通过高达2.7倍的加速DNN的批处理大小组成。

Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational resources. Unlike recent initiatives on inference servers and inference frameworks, which focus on the prediction of single DNNs, we propose a new software layer to serve with flexibility and efficiency ensembles of DNNs. Our inference system is designed with several technical innovations. First, we propose a novel procedure to find a good allocation matrix between devices (CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate DNNs into the memory devices and a greedy algorithm to optimize allocation settings and speed up the ensemble. Second, we design the inference system based on multiple processes to run asynchronously: batching, prediction, and the combination rule with an efficient internal communication scheme to avoid overhead. Experiments show the flexibility and efficiency under extreme scenarios: It successes to serve an ensemble of 12 heavy DNNs into 4 GPUs and at the opposite, one single DNN multi-threaded into 16 GPUs. It also outperforms the simple baseline consisting of optimizing the batch size of DNNs by a speedup up to 2.7X on the image classification task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题