MPI+线程应用程序的可扩展通信终点

论文标题

MPI+线程应用程序的可扩展通信终点

Scalable Communication Endpoints for MPI+Threads Applications

论文作者

Zambre, Rohit, Chandramowlishwaran, Aparna, Balaji, Pavan

论文摘要

Hybrid MPI+线程编程正在获得突出性，作为传统“ MPI无处不在”模型的替代方案，可以更好地处理与其他节点资源相比，核心数量的不成比例增加。这两个模型的当前实现代表了现代MPI实施中的两个极端沟通资源共享的案例。在MPI Averywhere模型中，每个MPI过程都有一组专用的通信资源（也称为端点），这是性能的理想之选，但是浪费资源的。使用MPI+线程，当前的MPI实现共享所有线程的单个通信终点，这是资源使用情况的理想选择，但对性能造成了伤害。在本文中，我们探讨了MPI+线程环境中的性能和通信资源使用之间的权衡空间。我们首先演示了两种极端情况 - 一个线程共享一个单个通信终点，另一个线程都有自己的专用通信端点（类似于MPI-Everywhere模型）并在这两种情况下都展示效率低下的情况。接下来，我们对Mellanox Infiniband背景下的不同级别的资源共享进行了详尽的分析。利用从此分析中学到的经验教训，我们设计了一个改进的资源共享模型来生成\ emph {可扩展的通信终点}，该模型可以实现与每个线程的专用通信资源相同的性能，但仅使用三分之一的资源。

Hybrid MPI+threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. In the MPI-everywhere model, each MPI process has a dedicated set of communication resources (also known as endpoints), which is ideal for performance but is resource wasteful. With MPI+threads, current MPI implementations share a single communication endpoint for all threads, which is ideal for resource usage but is hurtful for performance. In this paper, we explore the tradeoff space between performance and communication resource usage in MPI+threads environments. We first demonstrate the two extreme cases---one where all threads share a single communication endpoint and another where each thread gets its own dedicated communication endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies in both these cases. Next, we perform a thorough analysis of the different levels of resource sharing in the context of Mellanox InfiniBand. Using the lessons learned from this analysis, we design an improved resource-sharing model to produce \emph{scalable communication endpoints} that can achieve the same performance as with dedicated communication resources per thread but using just a third of the resources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题