论文标题
在网格网络上的高度可用数据并行ML培训
Highly Available Data Parallel ML training on Mesh Networks
论文作者
论文摘要
数据并行ML模型可能需要几天或几周才能训练多个加速器。较长的培训持续时间依赖于资源集群,以便在整个过程中继续运行。在网络网络上,这是具有挑战性的,因为故障会在网格中产生孔。必须在失败的芯片周围路由数据包,以使其完全连接。在本文中,我们介绍了沿二维网格上失败芯片的梯度求和的技术。我们通过MLPERF-V0.7 RESNET-50和BERT基准评估了容忍度的垂体垂体技术的性能。性能结果显示,对512和1024 TPU-V3芯片的训练吞吐量的影响最小。
Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed chips for full connectivity. In this paper, we present techniques to route gradient summation allreduce traffic around failed chips on 2-D meshes. We evaluate performance of our fault tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT benchmarks. Performance results show minimal impact to training throughput on 512 and 1024 TPU-v3 chips.