论文标题
在线性代数内核的平行I/O最优性上:近乎最佳的LU分解
On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization
论文作者
论文摘要
密集的线性代数内核,例如线性求解器或张量收缩,是许多科学计算应用的基本组成部分。在这项工作中,我们提出了一种新的方法,可以为这个广泛的计划提供平行I/O的下限。基于X分区抽象,我们的方法明确捕获了统计依赖性。将我们的分析应用于lu分解,我们得出了Conflux,这是一种LU算法,其平行I/O成本为$ n^3/(p \ sqrt {m})$每个处理器传达的元素 - 在我们已建立的下限上,只有$ 1/3 \ times $。我们在各种问题大小上评估Conflux,证明了与我们的理论分析相匹配的经验结果,比Cray Scalapack或Slate的渐近交流少,并且表现优于渐近优美的CANDMC库。 Conflux以$ 1 $,$ 024 $ $ 024 $的节点传达了1.6 $ \ tims $ $ \ times $少于第二好的实施,预计在Summit上进行全尺寸运行中的2.1 $ \ tims $少$。
Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX, an LU algorithm with the parallel I/O cost of $N^3 / (P \sqrt{M})$ communicated elements per processor -- only $1/3\times$ over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating asymptotically less than Cray ScaLAPACK or SLATE, and outperforming the asymptotically-optimal CANDMC library. Running on $1$,$024$ nodes of Piz Daint, COnfLUX communicates 1.6$\times$ less than the second-best implementation and is expected to communicate 2.1$\times$ less on a full-scale run on Summit.