基于Barycentric Lagrange插值和双树遍历的GPU加速快速求和方法

论文标题

基于Barycentric Lagrange插值和双树遍历的GPU加速快速求和方法

A GPU-Accelerated Fast Summation Method Based on Barycentric Lagrange Interpolation and Dual Tree Traversal

论文作者

Wilson, Leighton, Vaughn, Nathan, Krasny, Robert

论文摘要

我们提出了用于粒子相互作用的Barycentric Lagrange双树遍历（BLDTT）快速求和方法。该方案通过自适应选择的粒子群集，聚类粒子和簇聚类近似近似粒子的相互作用代替了良好的粒子粒子相互作用，在每个群集的chebyshev网格上，在代理粒子上通过Barycentric Lagrange插值给出了群集聚类近似。 BLDTT是与内核无关的，可以有效地映射到GPU上，其中目标粒子提供了平行性的外部水平，源粒子提供了并行性的内部水平。我们提出了BLDTT的OpenACC GPU实现，其中具有MPI远程存储器访问分布式内存并行化。对于具有不同问题大小，粒子分布，几何域和相互作用内核以及不等目标和不等目标和源粒子的计算，证明了GPU加速bldtt的性能。与我们较早的粒子簇barycentric Lagrange Teecode（BLTC）的比较证明了BLDTT的出色性能。特别是，在单个GPU上，问题尺寸从$ n $ = 1E5到1E8不等，BLTC具有$ O（n \ log n）$缩放，而bldtt则具有$ O（n）$缩放。此外，使用$ n $ = 64e6粒子在高达32 gpus的情况下为BLTC和BLDTT提出了MPI强缩放结果。

We present the barycentric Lagrange dual tree traversal (BLDTT) fast summation method for particle interactions. The scheme replaces well-separated particle-particle interactions by adaptively chosen particle-cluster, cluster-particle, and cluster-cluster approximations given by barycentric Lagrange interpolation at proxy particles on a Chebyshev grid in each cluster. The BLDTT is kernel-independent and the approximations can be efficiently mapped onto GPUs, where target particles provide an outer level of parallelism and source particles provide an inner level of parallelism. We present an OpenACC GPU implementation of the BLDTT with MPI remote memory access for distributed memory parallelization. The performance of the GPU-accelerated BLDTT is demonstrated for calculations with different problem sizes, particle distributions, geometric domains, and interaction kernels, as well as for unequal target and source particles. Comparison with our earlier particle-cluster barycentric Lagrange treecode (BLTC) demonstrates the superior performance of the BLDTT. In particular, on a single GPU for problem sizes ranging from $N$=1E5 to 1E8, the BLTC has $O(N\log N)$ scaling, while the BLDTT has $O(N)$ scaling. In addition, MPI strong scaling results are presented for the BLTC and BLDTT using $N$=64E6 particles on up to 32 GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题