论文标题

部分可观测时空混沌系统的无模型预测

Accelerating Distributed MoE Training and Inference with Lina

论文作者

Li, Jiamin, Jiang, Yimin, Zhu, Yibo, Wang, Cong, Xu, Hong

论文摘要

缩放模型参数以高计算开销的价格提高了模型质量。稀疏激活的模型,通常以专家(MOE)体系结构的混合形式形式,具有与模型尺寸的计算成本的次线性缩放缩放,从而为训练和以比其密集配料更低的成本提供训练和提供更大的模型。但是,分布式的MOE培训和推断效率低下,这主要是由于模型计算过程中的全部通信。本文做出了两个主要贡献。首先,我们系统地分析了分布式MOE中的全部大开销,并给出了分别是训练和推理的瓶颈的主要原因。其次,我们设计并建造莉娜(Lina),以迎接全能的瓶颈。当使用张量分区可行时,Lina会优先考虑与并发的同时相比,因此,全力以赴和训练步骤得到了改善。 Lina进一步利用了专家选择的固有模式,以在推理期间动态安排资源,以便在跨设备的全部转移尺寸和带宽中,在实践中极为偏差的专家受欢迎程度中平衡。 A100 GPU测试床上的实验表明,LINA将训练时间减少到1.73倍,并使95%的ILE推理时间平均减少了1.63倍。

Scaling model parameters improves model quality at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have sub-linear scaling of computation cost with model size, thus providing opportunities to train and serve a larger model at lower cost than their dense counterparts. However, distributed MoE training and inference is inefficient, mainly due to the interleaved all-to-all communication during model computation. This paper makes two main contributions. First, we systematically analyze all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, we design and build Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor partitioning, so all-to-all and training step time is improved. Lina further exploits the inherent pattern of expert selection to dynamically schedule resources during inference, so that the transfer size and bandwidth of all-to-all across devices are balanced amid the highly skewed expert popularity in practice. Experiments on an A100 GPU testbed show that Lina reduces the training step time by up to 1.73x and reduces the 95%ile inference time by an average of 1.63x over the state-of-the-art systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源