使用TestsNAP和LAMMPS快速探索对高级体系结构的优化策略

论文标题

使用TestsNAP和LAMMPS快速探索对高级体系结构的优化策略

Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS

论文作者

Gayatri, Rahulkumar, Moore, Stan, Weinberg, Evan, Lubbers, Nicholas, Anderson, Sarah, Deslippe, Jack, Perez, Danny, Thompson, Aidan P.

论文摘要

Exascale竞赛即将结束，宣布Aurora和Frontier机器。下一代超级计算机利用多样化的硬件体系结构来实现其计算性能，从而增加了应用程序性能可移植性。编程模型的不断扩展的碎片化将提供一个更加复杂的优化挑战，如果不是为了演变可履行的框架，而是为统一的模型绘制了平行层次的抽象层次结构，以将其映射到各种体系结构上。解决这一挑战的解决方案是绩效 - 可容纳框架的演变，为将平行性的抽象层次结构映射到各种体系结构中提供了统一的模型。 Kokkos是C ++应用程序的性能便携式编程模型之一，为每个主要HPC平台提供后端实现。即使使用性能便携式框架，重组算法以暴露较高的并行度也是不平凡的。光谱邻域分析电位（SNAP）是用于尖端分子动力学模拟中的机器学习的原子间电位。 SNAP计算的先前实现表明，其性能下降趋势相对于新一代CPU和GPU的峰值性能低。在本文中，我们描述了按照NVIDIA GPU的基准测试的LAMMPS分子动力包的Kokkos CUDA后端实现的SNAP的重组和优化。我们确定了层次并行性的新型模式，促进了记忆访问开销的最小化，并将实现推向了计算饱和的制度。我们通过Kokkos实施的实施可以在即将到来的体系结构上进行重新编译和运行效率。我们发现相对于现有实施的$ \ sim $ 22倍的时间改进，如在NVIDIA TESLA V100-16GB上用于重要基准的现有实现。

The exascale race is at an end with the announcement of the Aurora and Frontier machines. This next generation of supercomputers utilize diverse hardware architectures to achieve their compute performance, providing an added onus on the performance portability of applications. An expanding fragmentation of programming models would provide a compounding optimization challenge were it not for the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. A solution to this challenge is the evolution of performance-portable frameworks, providing unified models for mapping abstract hierarchies of parallelism to diverse architectures. Kokkos is one such performance portable programming model for C++ applications, providing back-end implementations for each major HPC platform. Even with a performance portable framework, restructuring algorithms to expose higher degrees of parallelism is non-trivial. The Spectral Neighbor Analysis Potential (SNAP) is a machine-learned inter-atomic potential utilized in cutting-edge molecular dynamics simulations. Previous implementations of the SNAP calculation showed a downward trend in their performance relative to peak on newer-generation CPUs and low performance on GPUs. In this paper we describe the restructuring and optimization of SNAP as implemented in the Kokkos CUDA backend of the LAMMPS molecular dynamics package, benchmarked on NVIDIA GPUs. We identify novel patterns of hierarchical parallelism, facilitating a minimization of memory access overheads and pushing the implementation into a compute-saturated regime. Our implementation via Kokkos enables recompile-and-run efficiency on upcoming architectures. We find a $\sim$22x time-to-solution improvement relative to an existing implementation as measured on an NVIDIA Tesla V100-16GB for an important benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题