间接流语义寄存器架构，用于高效稀疏密集线性代数

论文标题

间接流语义寄存器架构，用于高效稀疏密集线性代数

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

论文作者

Scheffler, Paul, Zaruba, Florian, Schuiki, Fabian, Hoefler, Torsten, Benini, Luca

论文摘要

稀疏密度的线性代数在许多域中至关重要，但在CPU，GPU和加速器上有效处理的挑战性挑战。具有稀疏格式（例如CSR和CSF）的乘法需要间接的内存查找。在这项工作中，我们增强了记忆流的RISC-V ISA扩展，通过流媒体间接加速了稀疏密度的产品。我们使用硬件提出有效的点，矩阵向量和矩阵 - 矩阵产品内核，从而在无需扩展的基线上实现了高达80％的单核FPU利用率，高达7.2倍，高达7.2倍。多核群集上的矩阵向量实现比优化的基线更快地高达5.8倍，使用核的能源效率高2.7倍。我们提出了用于间接硬件的进一步用途，例如分散的收集操作和代码手册解码，并将我们的工作与最新的CPU，GPU和加速器方法进行比较，在CSR Matrix-vector乘法中，与GTX 1080 TI GPU相比，在CSR Matrix-vector乘法中，测量了2.8倍的FP64利用率。

Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multi-core cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel.

下载PDF全文

下载文献需遵守相关版权规定

论文标题