用于机器学习的分布式内存稀疏内核

论文标题

用于机器学习的分布式内存稀疏内核

Distributed-Memory Sparse Kernels for Machine Learning

论文作者

Bharadwaj, Vivek, Buluç, Aydin, Demmel, James

论文摘要

采样密集的时间密度矩阵乘法（SDDMM）和稀疏时间密集矩阵乘法（SPMM）出现在不同的设置中，例如协作过滤，文档群集和图形嵌入。通常，SDDMM输出成为后续SPMM操作的输入稀疏矩阵。现有的工作集中于这些原始人的共享记忆并行化。尽管对SPMM的分布式1.5D算法进行了广泛的分析，但对于SDDMM或SDDMM和SPMM的背对背序列尚无这样的分析，称为Fusedmm。我们显示，SPMM的分布式内存1.5D和2.5D算法可以通过相同的通信成本和输入 /输出数据布局转换为SDDMM算法。此外，我们提供了两种沟通策略，以进一步降低Fusedmm内核的成本：重复复制SDMM和SPMM的输入密度矩阵的复制，或者融合本地SDDMM和SPMM内核。我们使用ERDOS-RENYI随机矩阵和大型现实世界稀疏矩阵在Cori上基准融合了Cori上的算法。在256个节点，每个节点具有68个核心，使用两种通信方法的1.5D Fusedmm算法可以节省至少30％的时间，而与执行分布式 - 内存SPMM和SDDMM内核相比，在通信上花费的时间至少可以节省。在具有数亿个边缘的实际矩阵上，我们所有的算法在PETSC中的SPMM算法上至少显示出10倍的速度。在这些矩阵上，我们的通信式技术的运行时间比未优化的SDDMM和SPMM快的速度快1.6倍。我们嵌入和测试算法在现实世界应用中的缩放，包括通过交替的最高方格进行协作过滤以及对基于注意力的图形神经网络的推断。

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题