论文标题

具有任意界限的投影嵌套环的通信 - 最佳砖块

Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds

论文作者

Dinh, Grace, Demmel, James

论文摘要

减少通信 - 在内存层次结构的级别或网络上的处理器之间,是许多问题的性能优化(在时间和能量中)的关键组成部分,包括密集的线性代数,粒子交互和机器学习。对于可以表示为嵌套环计算的这些问题,已经使用了以前的基于平铺的方法来找到执行它们所需的通信和最佳重排或最佳的封锁或阻止以达到这样的下限的下限。但是,这种一般的方法通常假定问题大小很大,这在实践中通常不可满足。例如,经典$(\ text {#arithmetic operations})/(\ text {cache size})^{1/2} $对于矩阵乘法而言,对于矩阵 - 矢量乘法而言,矩阵乘法的下限并不紧密,必须在至少$ o中读取$ o(\ text {\ text {#arithmetic操作}#arithmetimetimetimetimetimetimetimetimet of morme of moreme of morembor of morembor;机器学习应用程序中几乎所有卷积的问题都发生了类似的问题,这些卷积使用极小的滤镜大小(因此,循环范围)。 在本文中,我们提供了一种有效的方法,可以通过适当的,有效构造的阻塞,通信下限和匹配砖块来查找和获取,这些嵌套的嵌套循环程序具有这些下限,该嵌套循环程序具有任意循环界限,这些循环界限在投射情况下在多维阵列上运行,在射程组中,阵列Indices是Loop Indices的子集。我们的方法在所有此类问题上都起作用,无论维度,大小,内存访问模式或阵列数量如何,并直接应用于(除其他示例)矩阵乘法和类似的密集线性代数操作,张量收缩,N-Body成对相互作用,点式卷积,点旋转卷积和完全连接的层。

Reducing communication - either between levels of a memory hierarchy or between processors over a network - is a key component of performance optimization (in both time and energy) for many problems, including dense linear algebra, particle interactions, and machine learning. For these problems, which can be represented as nested-loop computations, previous tiling based approaches have been used to find both lower bounds on the communication required to execute them and optimal rearrangements, or blockings, to attain such lower bounds. However, such general approaches have typically assumed the problem sizes are large, an assumption that is often not met in practice. For instance, the classical $(\text{# arithmetic operations})/(\text{cache size})^{1/2}$ lower bound for matrix multiplication is not tight for matrix-vector multiplications, which must read in at least $O(\text{# arithmetic operations})$ words of memory; similar issues occur for almost all convolutions in machine learning applications, which use extremely small filter sizes (and therefore, loop bounds). In this paper, we provide an efficient way to both find and obtain, via an appropriate, efficiently constructible blocking, communication lower bounds and matching tilings which attain these lower bounds for nested loop programs with arbitrary loop bounds that operate on multidimensional arrays in the projective case, where the array indices are subsets of the loop indices. Our approach works on all such problems, regardless of dimensionality, size, memory access patterns, or number of arrays, and directly applies to (among other examples) matrix multiplication and similar dense linear algebra operations, tensor contractions, n-body pairwise interactions, pointwise convolutions, and fully connected layers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源