使用LLVM Clang/Polly Loop优化的贝叶斯优化的自动调整多个基准测试

论文标题

使用LLVM Clang/Polly Loop优化的贝叶斯优化的自动调整多个基准测试

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization

论文作者

Wu, Xingfu, Kruse, Michael, Balaprakash, Prasanna, Finkel, Hal, Hovland, Paul, Taylor, Valerie, Hall, Mary

论文摘要

自动调整是一种方法，它通过在目标平台上选择和评估实现/或配置的子集和/或使用模型来识别高性能实现/配置，从而探索了内核或应用程序可能实现/配置的搜索空间。在本文中，我们开发了一个自动调整框架，该框架利用贝叶斯优化来探索参数空间搜索。我们从Polybench基准的应用域（SYR2K，3mm，Heat-3d，Lu，Concoriance和Floyd-Warshall）中选择了六个最复杂的基准测试域，并将新开发的LLVM Clang/Polly Loop优化布拉格斯施加到基准中以优化它们。然后，我们使用自动调整框架来优化Pragma参数以提高其性能。实验结果表明，我们的自动调用方法的表现优于其他编译方法，可以为基准测试的最小执行时间SYR2K，3mm，Heat-3d，Lu和200个大型数据集中的200个代码评估中的两个大数据集，以有效地搜索具有高达170,368种不同配置的参数空间。我们比较了贝叶斯优化中的四种不同监督学习方法，并评估其有效性。我们发现Floyd-Warshall基准并没有从自动传动中受益，因为Polly使用启发式方法来优化基准以使其运行速度较慢。为了应付此问题，我们提供了一些编译器选项解决方案来提高性能。

An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models to identify a high performance implementation/configuration. In this paper, we develop an autotuning framework that leverages Bayesian optimization to explore the parameter space search. We select six of the most complex benchmarks from the application domains of the PolyBench benchmarks (syr2k, 3mm, heat-3d, lu, covariance, and Floyd-Warshall) and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题