拖放是密集线性代数算法的判别

论文标题

拖放是密集线性代数算法的判别

FLOPs as a Discriminant for Dense Linear Algebra Algorithms

论文作者

López, Francisco, Karlsson, Lars, Bientinesi, Paolo

论文摘要

通常通过一系列调用来评估涉及矩阵和载体（称为线性代数表达式）的表达式，以对库中提供的高度优化内核（例如Blas和Lapack）提供的高度优化的内核进行评估。一系列内核代表一种算法，并且通常由于关联性，代数身份和多个内核，可以通过许多不同的算法评估一个表达式。这些算法在数学上都是等效的（即，在确切的算术中，它们都计算出相同的结果），但在执行时间方面通常会明显差异。当面对决定时，高级语言，库和工具（例如Julia，Armadillo和Linnea）通过选择最小化Flop计数的算法来选择。在本文中，我们测试了Flop计数的有效性是对密集线性代数算法的判别，分析了“异常”：最快算法不会执行最少数量触发器的问题实例。为此，我们专注于相对简单的表达式，并分析了何时以及为什么发生异常。我们发现存在异常并倾向于聚集在大型连续区域中。一个表达异常很少见，而对于另一个表达异常，它们很丰富。我们得出的结论是，即使用高度优化的内核构建算法，FLOPS也不是一个足够可靠的判别。另外，即使在滤除了内核间缓存效应后，大多数异常仍然存在。我们猜想，将插槽计数与内核性能模型相结合将显着提高我们选择最佳算法的能力。

Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing "anomalies": problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题