Parthenon-性能便携式块结构的自适应网状精炼框架

论文标题

Parthenon-性能便携式块结构的自适应网状精炼框架

Parthenon -- a performance portable block-structured adaptive mesh refinement framework

论文作者

Grete, Philipp, Dolence, Joshua C., Miller, Jonah M., Brown, Joshua, Ryan, Ben, Gaspar, Andrew, Glines, Forrest, Swaminarayan, Sriram, Lippuner, Jonas, Solomon, Clell J., Shipman, Galen, Junghans, Christoph, Holladay, Daniel, Stone, James M., Roberts, Luke F.

论文摘要

在Exascale的道路上，计算机设备架构和相应的编程模型的景观变得更加多样化。尽管可以使用各种低级性能便携式编程模型，但在应用程序级别的支持缺乏。为了解决这个问题，我们介绍了性能便携式块结构的自适应网格精炼（AMR）框架Parthenon，该框架源自经过良好测试且广泛使用的Athena+ Athena+ Athena+ Athena+ Athena+ Astrophysical Magnetrophysical磁性水力学代码，但被推广为为各种下游多物理代码提供基础。 Parthenon采用Kokkos编程模型，并提供从多维变量到定义组件的包装到启动并行计算内核的各种级别的抽象。 Parthenon在设备内存中分配了所有数据，以减少数据移动，支持变量和网格块的逻辑包装，以减少内核启动开销，并采用单方面的，异步的MPI调用来减少多节点模拟中的通信开销。使用流体动力学Miniapp，我们在包括AMD和NVIDIA GPU，Intel和AMD X86 CPU，IBM Power9 CPU以及Fujitsu A64FX CPU（包括AMD和NVIDIA GPU，INTEL和AMD X86 CPU，INTEL和AMD X86 CPU）上表现出弱和强的缩放。在Frontier上（第一台Top500 Exascale机器）上，Miniapp的总计1.7 \ times10^{13} $ Zone-Cycles/s在9,216个节点（73,728逻辑GPU）上，以〜92％的弱缩放平行效率（从单个节点开始）。结合是一个开放的协作项目，这使Parthenon成为针对Exascale模拟的理想框架，在该框架中，下游开发人员可以专注于其特定应用，而不是处理大规模平行，设备加速AMR的复杂性。

On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multi-dimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of $1.7\times10^{13}$ zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at ~92% weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题