论文标题

GPU张量核心用于快速算术减少

GPU Tensor Cores for fast Arithmetic Reductions

论文作者

Navarro, Cristóbal A., Carrasco, Roberto, Barrientos, Ricardo J., Riquelme, Javier A., Vega, Raimundo

论文摘要

这项工作提出了一种GPU Tensor核心方法,该方法编码了$ n $数字的算术减少,作为一组链式$ M \ times m $矩阵乘积(MMA)操作(MMA)由GPU张量核心并行执行。提议的链式张量核心方法的渐近运行时间为$ t(n)= 5 log_ {m^2} {n} $,其速度为$ s = \ dfrac {4} {4} {5} log_ {2} log_ {2} {2} {m^2} $(n \ log n)$ n \ log n)$ coallall borallal reallithm。实验性能结果表明,所提出的减少方法比传统的GPU减少实现更快,并保留了数值精度,因为在每个链条$ r $ MMA的子量的子量都被保留为32位浮点值,然后全部降低到最终的32位结果。链式的MMA设计允许螺纹块的灵活配置; 32或128个线程的小螺纹块仍然可以使用$ r = 4,5 $ MMA的链条达到最高性能,而大型螺纹块最有效,$ r = 1 $。在这项工作中获得的结果表明,张量核心确实可以为非机器学习应用(例如算术减少)提供显着的性能改善,算术减少是一种集成工具,用于研究许多科学现象。

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源