论文标题
使用参数服务弹性模型聚合
Elastic Model Aggregation with Parameter Service
论文作者
论文摘要
模型聚合是更新模型参数的过程,是分布式深度学习(DDL)中模型收敛的重要步骤。但是,由于聚集和静态资源分配的爆发性质,参数服务器(PS)是执行模型聚合的流行范式,导致CPU在深度学习(DL)群集中造成不足。为了解决此问题,我们提出了参数服务,这是DDL培训的弹性模型聚合框架,该框架将模型聚合的功能与单个培训工作解除,并为集群中的所有作业提供了共享的模型聚合服务。在参数服务中,模型聚合有效地包装并动态迁移,以适合具有可忽略的时间开销的可用CPU。此外,参数服务可以根据其负载来弹性地管理其CPU资源,以提高资源效率。我们已经在称为Aopops的原型系统中实现了参数服务,并通过测试台实验和痕量驱动模拟对其进行了评估。 Aosops可减少多达75%的CPU消费量,而对培训工作的影响很小或没有影响。参数服务的设计对用户是透明的,可以将其纳入流行的DL框架中。
Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU underutilization in deep learning (DL) clusters, due to the bursty nature of aggregation and static resource allocation. To remedy this problem, we propose Parameter Service, an elastic model aggregation framework for DDL training, which decouples the function of model aggregation from individual training jobs and provides a shared model aggregation service to all jobs in the cluster. In Parameter Service, model aggregations are efficiently packed and dynamically migrated to fit into the available CPUs with negligible time overhead. Furthermore, Parameter Service can elastically manage its CPU resources based on its load to enhance resource efficiency. We have implemented Parameter Service in a prototype system called AutoPS and evaluated it via testbed experimentation and trace-driven simulations. AutoPS reduces up to 75% of CPU consumption with little or no performance impact on the training jobs. The design of Parameter Service is transparent to the users and can be incorporated in popular DL frameworks.