论文标题
A64FX上流核的性能建模和稀疏矩阵矢量乘法
Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX
论文作者
论文摘要
A64FX CPU为TOP500列表中的当前排名第一的超级计算机提供动力。尽管它是传统的基于缓存的多项处理器,但其峰值性能和内存带宽竞争对手加速器设备。为这种新体系结构生成有效的代码需要很好地了解其性能功能。使用这些功能,我们为FX700超级计算机中的A64FX处理器构建了执行-CACHE-MEMORY(ECM)性能模型,并使用流循环进行验证。我们还确定建筑特点并得出优化提示。将ECM模型应用于稀疏矩阵矢量乘法(SPMV),我们激励了为什么CRS矩阵存储格式不合适,以及如何具有适当代码优化的卖出C-Sigma格式如何实现SPMV的带宽饱和度。
The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV.