论文标题

高效且灵活的推理系统,用于服务深度神经网络的异质合奏

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

论文作者

Pochelu, Pierrick, Petiton, Serge G., Conche, Bruno

论文摘要

深度神经网络(DNN)的集合已经实现了定性预测,但它们是计算和记忆密集型的。因此,需求越来越多,以使他们通过可用的计算资源来回答大量的请求。与最近针对单个DNN的预测推理服务器和推理框架的计划不同,我们提出了一个新的软件层,以灵活性和效率DNNS的效率组成。 我们的推理系统设计了几项技术创新。首先,我们提出了一种新的程序,以在设备(CPU或GPU)和DNN实例之间找到一个良好的分配矩阵。它连续运行最差的功能,可以将DNN分配到存储器设备和贪婪的算法中,以优化分配设置并加快合奏。其次,我们根据多个过程设计推理系统以异步运行:批处理,预测和组合规则,具有有效的内部通信方案,以避免开销。 实验显示了极端情况下的灵活性和效率:成功地将12个重型DNN的合奏提供在4 GPU中,而在相反的相反,一个单个DNN多线程将其分成16 GPU。它还胜过简单的基线,该基线包括在图像分类任务上通过高达2.7倍的加速DNN的批处理大小组成。

Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational resources. Unlike recent initiatives on inference servers and inference frameworks, which focus on the prediction of single DNNs, we propose a new software layer to serve with flexibility and efficiency ensembles of DNNs. Our inference system is designed with several technical innovations. First, we propose a novel procedure to find a good allocation matrix between devices (CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate DNNs into the memory devices and a greedy algorithm to optimize allocation settings and speed up the ensemble. Second, we design the inference system based on multiple processes to run asynchronously: batching, prediction, and the combination rule with an efficient internal communication scheme to avoid overhead. Experiments show the flexibility and efficiency under extreme scenarios: It successes to serve an ensemble of 12 heavy DNNs into 4 GPUs and at the opposite, one single DNN multi-threaded into 16 GPUs. It also outperforms the simple baseline consisting of optimizing the batch size of DNNs by a speedup up to 2.7X on the image classification task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源