论文标题
通过拓扑和缺陷感知过程放置,提高MPI并行工作的性能和弹性
Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement
论文作者
论文摘要
HPC系统的规模不断增长,以满足对性能和计算资源的不断增长的需求。除了提高性能外,大规模系统面临两个挑战,阻碍了进一步的增长:能源效率和弹性。同时,寻求提高性能的应用程序取决于高级并行性来利用系统资源,从而导致对系统互连的压力增加。在大型系统尺度上,就应用性能和能耗而言,沟通量的增加可能是有益的。朝向这个方向,一些研究着重于将应用程序的过程映射到系统节点,以降低通信成本。一种常见的方法是将应用程序的通信模式和系统体系结构表示为图形,然后解决相应的映射问题。除了沟通成本外,工作的完成时间也可能受节点失败的影响。节点失败可能会导致工作流产,需要重新启动工作。在本文中,我们解决了将流程分配给系统资源的问题,以减少通信成本,同时考虑节点失败。所提出的方法已集成到SLURM资源管理器中。评估结果表明,在几个节点的中断概率较低的情况下,提出的过程放置方法在MPI作业批处理的完成时间中显着减少。与Slurm中的默认过程放置方法相比,对于两个不同的MPI应用,降低分别为18.9%和31%。
HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem. Apart from communication cost, the completion time of a job can also be affected by node failures. Node failures may result in job abortions, requiring job restarts. In this paper, we address the problem of assigning processes to system resources with the goal of reducing communication cost while also taking into account node failures. The proposed approach is integrated into the Slurm resource manager. Evaluation results show that, in scenarios where few nodes have a low outage probability, the proposed process placement approach achieves a notable decrease in the completion time of batches of MPI jobs. Compared to the default process placement approach in Slurm, the reduction is 18.9% and 31%, respectively for two different MPI applications.