GPU张量核心用于快速算术减少

论文标题

GPU张量核心用于快速算术减少

GPU Tensor Cores for fast Arithmetic Reductions

论文作者

Navarro, Cristóbal A., Carrasco, Roberto, Barrientos, Ricardo J., Riquelme, Javier A., Vega, Raimundo

论文摘要

这项工作提出了一种GPU Tensor核心方法，该方法编码了$ n $数字的算术减少，作为一组链式$ M \ times m $矩阵乘积（MMA）操作（MMA）由GPU张量核心并行执行。提议的链式张量核心方法的渐近运行时间为$ t（n）= 5 log_ {m^2} {n} $，其速度为$ s = \ dfrac {4} {4} {5} log_ {2} log_ {2} {2} {m^2} $（n \ log n）$ n \ log n）$ coallall borallal reallithm。实验性能结果表明，所提出的减少方法比传统的GPU减少实现更快，并保留了数值精度，因为在每个链条$ r $ MMA的子量的子量都被保留为32位浮点值，然后全部降低到最终的32位结果。链式的MMA设计允许螺纹块的灵活配置； 32或128个线程的小螺纹块仍然可以使用$ r = 4,5 $ MMA的链条达到最高性能，而大型螺纹块最有效，$ r = 1 $。在这项工作中获得的结果表明，张量核心确实可以为非机器学习应用（例如算术减少）提供显着的性能改善，算术减少是一种集成工具，用于研究许多科学现象。

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

下载PDF全文

下载文献需遵守相关版权规定

论文标题