论文标题
通过编码理论,耐Straggler的分布式矩阵计算
Straggler-resistant distributed matrix computation via coding theory
论文作者
论文摘要
当前的BigData ERA通常需要在大规模分布式计算簇上处理大型数据。如此大的簇通常会遇到“散乱者”的问题,这些问题被定义为缓慢或失败的节点。在这些群集上,计算工作的总体速度通常由Stragglers主导,而没有对工作节点进行任务的精致分配。近年来,基于编码理论(称为“编码计算”)的方法已被有效地用于缓解。编码计算为特定类别的问题(例如分布式矩阵计算)(在机器学习管道的多个部分中起着至关重要的作用)提供了重大好处。必不可少的想法是创建冗余任务,以便只要一定数量的工人节点完成任务,就可以恢复所需的结果。在这篇调查文章中,我们概述了编码散乱的分布式矩阵计算领域的最新发展。
The current BigData era routinely requires the processing of large scale data on massive distributed computing clusters. Such large scale clusters often suffer from the problem of "stragglers", which are defined as slow or failed nodes. The overall speed of a computational job on these clusters is typically dominated by stragglers in the absence of a sophisticated assignment of tasks to the worker nodes. In recent years, approaches based on coding theory (referred to as "coded computation") have been effectively used for straggler mitigation. Coded computation offers significant benefits for specific classes of problems such as distributed matrix computations (which play a crucial role in several parts of the machine learning pipeline). The essential idea is to create redundant tasks so that the desired result can be recovered as long as a certain number of worker nodes complete their tasks. In this survey article, we overview recent developments in the field of coding for straggler-resilient distributed matrix computations.