论文标题

带有主内存加速器的加速带宽结合的深度学习推断

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

论文作者

Cho, Benjamin Y., Jung, Jeageun, Erez, Mattan

论文摘要

DL推理查询在不同的互联网服务中起着重要作用,并且在处理DL推理查询上花费了很大一部分数据中心周期。具体而言,完全连接的MLP层的矩阵矩阵乘法(GEMM)操作主导了许多推理任务。我们发现,数据中心DL推理任务的GEMM操作是内存带宽绑定的,与常见的假设相反:(1)严格的查询延迟约束强制小批量操作,这限制了重复使用并增加了带宽的需求; (2)大型且共同关联的模型需要从主内存中读取大量重量矩阵,再次需要高带宽,而无需提供重复使用机会。我们证明了通过主要CPU存储器在处理中加速这些小批量宝石的巨大潜力。我们开发了一种新颖的GEMM执行流和相应的内存侧地址生成逻辑,该逻辑利用了Gemm局部性,并启用了长期运行的PIM内核,尽管CPU采用了复杂的地址映射函数,否则该功能将破坏该区域。我们对频道,设备和设备内pim级别的Stepstone变体的评估,以及优化平衡的平行性和数据分配开销的优化表明,与CPU相比,$ 12 \ times $最小延迟比严格的查询延迟约束更高。最新建议和语言模型的端到端性能分析表明,Stepstone PIM的表现优于快速CPU(最高$ 16 \ times $)和先前的主机加速方法(与最佳先前方法相比,最高可达2.4 \ times $)。

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate $12\times$ better minimum latency than a CPU and $2.8\times$ greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to $16\times$) and prior main-memory acceleration approaches (by up to $2.4\times$ compared to the best prior approach).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源