优化任务安排和在线安排分布式GNN培训加速

论文标题

优化任务安排和在线安排分布式GNN培训加速

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

论文作者

Luo, Ziyue, Bao, Yixin, Wu, Chuan

论文摘要

大图上的训练图神经网络（GNN）是资源密集型且耗时的，这主要是由于不能适合单个计算机内存的大图数据，但必须从分布式图表存储中获取并在GO上处理。与分布式的深神经网络（DNN）培训不同，分布式GNN培训中的瓶颈主要在很大程度上取决于大量的图形数据传输，用于构建小型培训样品。现有的解决方案通常会主张数据计算托管，并且与有限的资源不可行，而托管是不可行的。战略任务放置的潜力以及数据传输和任务执行的最佳计划尚未得到充分探索。本文设计了一个有效的算法框架，用于分布式GNN培训的任务安排和执行计划，以更好地利用资源利用，改善执行管道和加快培训完成。我们的框架由两个模块组成：（i）一种在线调度算法，该算法安排了培训任务的执行以及数据传输计划；（ii）一个探索性任务放置方案，可以决定每个培训任务的位置。我们进行了彻底的理论分析，测试实验和仿真研究，与代表性基准相比，使用算法的训练速度高达67％。

Training Graph Neural Networks (GNN) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources where the colocation is infeasible. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training, to better resource utilization, improve execution pipelining, and expediting training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 67% training speed-up with our algorithm as compared to representative baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题