论文标题
在MPI并行和混合记忆的程序中,异步和波模式形成
Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs
论文作者
论文摘要
众所周知,分布式记忆并行代码的分析,第一原理性能建模是不精确的。即使对于具有极其规律和均匀的计算阶段的应用程序,仅将通信时间添加到计算时间通常不会产生令人满意的平行运行时预测,这是由于与预期的简单锁定模式偏离了由系统噪声,通信时间变化引起的预期简单锁定模式,以及固有的负载失衡。在本文中,我们重点介绍了内存结合,散装同步纯MPI和Hybrid MPI+OpenMP程序的挑衅和自发性的特定情况。使用简单的微型分析,我们观察到,尽管对同步可以引入每个过程的等待时间增加,但它不一定会导致资源利用率较低,但会导致每个核心可用带宽的增加。如果沟通巨大的开销,即使是自然噪声也可以将系统推向通信和计算的自动重叠状态,从而改善了解决方案的整体时间。饱和点,即实现完整存储器带宽所需的每个内存域的过程数,在此过程的动力学和新兴的稳定波模式中是关键的。我们还展示了混合MPI-Openmp编程如何通过消除过程之间的带宽瓶颈来防止理想的对同步。 Chebyshev滤波器对角线化应用用于证明在现实环境中观察到的一些效果。
Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.