论文标题
$ k $ - 竞争与$ k $ -lane广播,散布和Alltoall算法
$k$-ported vs. $k$-lane Broadcast, Scatter, and Alltoall Algorithms
论文作者
论文摘要
在$ k $ ported的消息通讯系统中,处理器可以同时从$ k $其他处理器接收$ k $不同的消息,并向$ k $不同的消息发送给$ k $ k $其他处理器,这些消息可能与收到消息的处理器有所不同。现代聚类系统可能没有这样的功能。取而代之的是,通过让节点上的$ k $处理器同时发送和接收到最多一条消息,可以同时发送和接收由$ n $处理器组成的计算节点同时发送和接收来自其他节点的$ K $消息。我们提出了一个问题,即如何为这种$ k $ lane型号设计好算法,这可能是通过适应为传统$ k $ port的模型设计的算法。我们讨论并比较了许多(非最佳)$ k $ lane算法,用于广播,散布和Alltoall集体操作(如MPI中所示),并在实验中评估了这些操作,并在一个小$ 36 \ times 32 $ -NODE群集上评估它们,并带有双重omnipath网络(对应于$ omnipath网络(对应于$ k = 2 $ k = 2 $)。结果是初步的。
In $k$-ported message-passing systems, a processor can simultaneously receive $k$ different messages from $k$ other processors, and send $k$ different messages to $k$ other processors that may or may not be different from the processors from which messages are received. Modern clustered systems may not have such capabilities. Instead, compute nodes consisting of $n$ processors can simultaneously send and receive $k$ messages from other nodes, by letting $k$ processors on the nodes concurrently send and receive at most one message. We pose the question of how to design good algorithms for this $k$-lane model, possibly by adapting algorithms devised for the traditional $k$-ported model. We discuss and compare a number of (non-optimal) $k$-lane algorithms for the broadcast, scatter and alltoall collective operations (as found in, e.g., MPI), and experimentally evaluate these on a small $36\times 32$-node cluster with a dual OmniPath network (corresponding to $k=2$). Results are preliminary.