论文标题
MAT2Stencel:基于模块化矩阵的DSL,用于在结构化网格上的显式和隐式矩阵的PDE求解器
Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid
论文作者
论文摘要
部分微分方程(PDE)求解器在众多科学和工程领域中广泛使用。但是,实现高性能和可伸缩性通常需要复杂且低级的编程,尤其是在利用结构化网格中的确定性稀疏模式时。 在本文中,我们提出了一种创新的域特异性语言(DSL),MAT2Stencil及其编译器,用于结构化网格上的PDE求解器。 MAT2Stencil引入了结构化的稀疏基质抽象,促进模块化,柔性且易于使用的求解器表达在广阔的光谱中,包括雅各比(Jacobi)或高斯(Jacobi)或高斯seidel先进的预处理,不完整的lu或cholesky lu或cholesky decompositions,以及对它们构建的杂物。我们的DSL编译器随后通过多阶段编程生成由通用模板组成的无基质代码。该代码除了雅各比风格的模具在空间维度上的令人尴尬的平行线外,该代码允许以准式循环的形式进行空间循环依赖。我们进一步为空间依赖的循环提出了一种新型的自动平行技术,该技术提供了用于螺纹的编译时间确定性任务分配,可以自动计算必要的线际间同步,并生成具有精细元素同步的有效的多线程实现。 实施4个基准计划,其中3个是NAS并行基准的伪应用程序,其代码为$ 6.3 \%$ $,1是无矩阵的高性能共轭梯度,代码为$ 16.4 \%$ $ $,我们达到了1.67美元的$ 1.67 \ $ 1.67 \ timple $ $ $ 1.03 \ $ 1.03 \ fipers $ times $ $ $ $ $ $ $ $ MAN的实施。
Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when leveraging deterministic sparsity patterns in structured grids. In this paper, we propose an innovative domain-specific language (DSL), Mat2Stencil, with its compiler, for PDE solvers on structured grids. Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum, encompassing components such as Jacobi or Gauss-Seidel preconditioners, incomplete LU or Cholesky decompositions, and multigrid methods built upon them. Our DSL compiler subsequently generates matrix-free code consisting of generalized stencils through multi-stage programming. The code allows spatial loop-carried dependence in the form of quasi-affine loops, in addition to the Jacobi-style stencil's embarrassingly parallel on spatial dimensions. We further propose a novel automatic parallelization technique for the spatially dependent loops, which offers a compile-time deterministic task partitioning for threading, calculates necessary inter-thread synchronization automatically, and generates an efficient multi-threaded implementation with fine-grained synchronization. Implementing 4 benchmarking programs, 3 of them being the pseudo-applications in NAS Parallel Benchmarks with $6.3\%$ lines of code and 1 being matrix-free High Performance Conjugate Gradients with $16.4\%$ lines of code, we achieve up to $1.67\times$ and on average $1.03\times$ performance compared to manual implementations.