论文标题
分析基于图像的GPU内核自动调整的搜索技术:样本大小的影响
Analyzing Search Techniques for Autotuning Image-based GPU Kernels: The Impact of Sample Sizes
论文作者
论文摘要
现代计算系统越来越复杂,其多功能CPU和GPU加速器每年都会改变,甚至更频繁。因此,编写有效使用相关的复杂内存系统并利用可用并行性的程序已经变得非常具有挑战性。自动调整通过搜索最佳参数集将参数化代码优化到目标硬件来解决此问题。因此,在过去的几十年中,经验自动调和引起了人们的兴趣。虽然定期发布和发布新的自动调整算法,但我们将展示为什么比较这些自动调用算法是一项欺骗性的任务。 在本文中,我们描述了我们对最先进的搜索技术进行自动调整的实证研究,通过对它们进行一系列样本量,基准和架构进行比较。我们优化了6个可调参数,其搜索空间大小超过200万。所研究的算法包括随机搜索(RS),随机森林回归(RF),遗传算法(GA),使用高斯工艺(BO GP)和贝叶斯优化的贝叶斯优化和贝叶斯优化,并用树斑估计量(BO TPE)进行优化。 我们在ImageCl基准套件上的结果表明,理想的自动召集算法在很大程度上取决于样本量。在我们的研究中,BO GP和BO TPE在大多数情况下的样本量从25到100的情况下都优于其他算法。但是,GA通常比其他算法的样本量均优于200及以上的样本量。我们通常认为,在较低的样本量(25-100)范围内,将获得最大的加速。但是,对于更高的样本量(200-400),该算法的表现更高。因此,对于所有样本量,没有一个最新的算法优于其余算法。还包括一些有关未来工作的建议。
Modern computing systems are increasingly more complex, with their multicore CPUs and GPUs accelerators changing yearly, if not more often. It thus has become very challenging to write programs that efficiently use the associated complex memory systems and take advantage of the available parallelism. Autotuning addresses this by optimizing parameterized code to the targeted hardware by searching for the optimal set of parameters. Empirical autotuning has therefore gained interest during the past decades. While new autotuning algorithms are regularly presented and published, we will show why comparing these autotuning algorithms is a deceptively difficult task. In this paper, we describe our empirical study of state-of-the-art search techniques for autotuning by comparing them on a range of sample sizes, benchmarks and architectures. We optimize 6 tunable parameters with a search-space size of over 2 million. The algorithms studied include Random Search (RS), Random Forest Regression (RF), Genetic Algorithms (GA), Bayesian Optimization with Gaussian Processes (BO GP) and Bayesian Optimization with Tree-Parzen Estimators (BO TPE). Our results on the ImageCL benchmark suite suggest that the ideal autotuning algorithm heavily depends on the sample size. In our study, BO GP and BO TPE outperform the other algorithms in most scenarios with sample sizes from 25 to 100. However, GA usually outperforms the others for sample sizes 200 and beyond. We generally see the most speedup to be gained over RS in the lower range of sample sizes (25-100). However, the algorithms more consistently outperform RS for higher sample sizes (200-400). Hence, no single state-of-the-art algorithm outperforms the rest for all sample sizes. Some suggestions for future work are also included.