用于模板计算的有效的过程对节点映射算法

论文标题

用于模板计算的有效的过程对节点映射算法

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

论文作者

Hunold, Sascha, von Kirchbach, Konrad, Lehr, Markus, Schulz, Christian, Träff, Jesper Larsson

论文摘要

良好的过程到计算节点映射对于表现良好的HPC应用程序可能是决定性的。一类特殊的，重要的过程对节点映射问题是映射过程的问题，这些过程以稀疏的模板模式传达到笛卡尔网格。通过彻底利用这种类型的问题中固有的结构，我们设计了三种新颖的分布式算法，这些算法能够有效地处理任意模板通信模式。我们基于节点间通信的抽象模型分析了算法的预期性能。在几台HPC机器上进行了广泛的实验评估表明，我们的算法在运行时间的最高两个数量级要比（顺序）高质量的一般图形映射工具快两个数量级，同时在通信性能方面获得了相似的结果。此外，与先前最先进的笛卡尔映射算法相比，我们的算法也获得了明显更好的映射质量。这将导致MPI_NEIGHBOR_ALLTOALL ALDENNE操作的三倍性能提高。我们的新算法可用于实现MPI_CART_CREATE功能。

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to Cartesian grids. By thoroughly exploiting the inherently present structure in this type of problem, we devise three novel distributed algorithms that are able to handle arbitrary stencil communication patterns effectively. We analyze the expected performance of our algorithms based on an abstract model of inter- and intra-node communication. An extensive experimental evaluation on several HPC machines shows that our algorithms are up to two orders of magnitude faster in running time than a (sequential) high-quality general graph mapping tool, while obtaining similar results in communication performance. Furthermore, our algorithms also achieve significantly better mapping quality compared to previous state-of-the-art Cartesian grid mapping algorithms. This results in up to a threefold performance improvement of an MPI_Neighbor_alltoall exchange operation. Our new algorithms can be used to implement the MPI_Cart_create functionality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题