用于优化GPU的动态并行性的编译器框架

论文标题

用于优化GPU的动态并行性的编译器框架

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

论文作者

Olabi, Mhd Ghaith, Luna, Juan Gómez, Mutlu, Onur, Hwu, Wen-mei, Hajj, Izzat El

论文摘要

GPU上的动态并行性允许GPU线程动态启动其他GPU线程。它在具有嵌套并行性的应用中很有用，尤其是在嵌套并行性不规则且无法事先预测的情况下。但是，先前的工作表明，当启动大量小网格时，动态并行性可能会施加高性能惩罚。大量发射导致由于拥堵而导致高发射潜伏期，而小的网格尺寸导致硬件不足。为了解决此问题，我们提出了一个编译器框架，以优化在具有嵌套并行性的应用中使用动态并行性。该框架具有三个关键优化：阈值，粗糙和聚合。阈值涉及只有在子螺纹的数量超过一定的阈值时，动态启动网格，否则将子螺纹序列化。粗糙涉及通过一个粗糙的块执行多个线程块的工作，以摊销它们的共同作品。聚集涉及将多个儿童网格组合到单个聚合网格中。我们的评估表明，我们的编译器框架比使用动态并行性的应用，通过43.0倍的几何平均值提高了应用程序的性能，在不使用动态并行性的应用程序上，应用于不使用动态并行性的应用程序的应用程序为8.7倍，而在先前的工作中，使用动态并行性的应用程序是使用动态并行性的3.6倍。

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization. To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid. Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0x over applications that use dynamic parallelism, 8.7x over applications that do not use dynamic parallelism, and 3.6x over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题