MAT2Stencel：基于模块化矩阵的DSL，用于在结构化网格上的显式和隐式矩阵的PDE求解器

论文标题

MAT2Stencel：基于模块化矩阵的DSL，用于在结构化网格上的显式和隐式矩阵的PDE求解器

Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid

论文作者

Cao, Huanqi, Tang, Shizhi, Zhu, Qianchao, Yu, Bowen, Chen, Wenguang

论文摘要

部分微分方程（PDE）求解器在众多科学和工程领域中广泛使用。但是，实现高性能和可伸缩性通常需要复杂且低级的编程，尤其是在利用结构化网格中的确定性稀疏模式时。在本文中，我们提出了一种创新的域特异性语言（DSL），MAT2Stencil及其编译器，用于结构化网格上的PDE求解器。 MAT2Stencil引入了结构化的稀疏基质抽象，促进模块化，柔性且易于使用的求解器表达在广阔的光谱中，包括雅各比（Jacobi）或高斯（Jacobi）或高斯seidel先进的预处理，不完整的lu或cholesky lu或cholesky decompositions，以及对它们构建的杂物。我们的DSL编译器随后通过多阶段编程生成由通用模板组成的无基质代码。该代码除了雅各比风格的模具在空间维度上的令人尴尬的平行线外，该代码允许以准式循环的形式进行空间循环依赖。我们进一步为空间依赖的循环提出了一种新型的自动平行技术，该技术提供了用于螺纹的编译时间确定性任务分配，可以自动计算必要的线际间同步，并生成具有精细元素同步的有效的多线程实现。实施4个基准计划，其中3个是NAS并行基准的伪应用程序，其代码为$ 6.3 \％$ $，1是无矩阵的高性能共轭梯度，代码为$ 16.4 \％$ $ $，我们达到了1.67美元的$ 1.67 \ $ 1.67 \ timple $ $ $ 1.03 \ $ 1.03 \ fipers $ times $ $ $ $ $ $ $ $ MAN的实施。

Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when leveraging deterministic sparsity patterns in structured grids. In this paper, we propose an innovative domain-specific language (DSL), Mat2Stencil, with its compiler, for PDE solvers on structured grids. Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum, encompassing components such as Jacobi or Gauss-Seidel preconditioners, incomplete LU or Cholesky decompositions, and multigrid methods built upon them. Our DSL compiler subsequently generates matrix-free code consisting of generalized stencils through multi-stage programming. The code allows spatial loop-carried dependence in the form of quasi-affine loops, in addition to the Jacobi-style stencil's embarrassingly parallel on spatial dimensions. We further propose a novel automatic parallelization technique for the spatially dependent loops, which offers a compile-time deterministic task partitioning for threading, calculates necessary inter-thread synchronization automatically, and generates an efficient multi-threaded implementation with fine-grained synchronization. Implementing 4 benchmarking programs, 3 of them being the pseudo-applications in NAS Parallel Benchmarks with $6.3\%$ lines of code and 1 being matrix-free High Performance Conjugate Gradients with $16.4\%$ lines of code, we achieve up to $1.67\times$ and on average $1.03\times$ performance compared to manual implementations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题