具有高记忆效率的Pascal GPU上的快速卷积内核

论文标题

具有高记忆效率的Pascal GPU上的快速卷积内核

Fast convolution kernels on pascal GPU with high memory efficiency

论文作者

Chang, Qiong, Onishi, Masaki, Maruyama, Tsutomu

论文摘要

卷积计算广泛用于许多领域，尤其是在CNN中。由于CNN中训练数据的迅速增长，GPU已用于加速，并且由于高性能而集中记忆效率的算法。在本文中，我们提出了两个用于单渠道卷积和多渠道卷积的卷积内核。我们的两种方法通过有效地隐藏了全局内存的访问延迟，并实现浮点融合的乘数乘以乘以乘以乘以的数据，从全局内存中实现了高性能。与NVIDIA开发的最新Cudnn库相比，旨在加速深度学习计算，单渠道的平均性能提高为2.6倍，多渠道的平均性能为1.4倍。

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier high performance. In this paper, we propose two convolution kernels for single-channel convolution and multi-channel convolution respectively. Our two methods achieve high performance by hiding the access delay of the global memory efficiently, and achieving high ratio of floating point Fused Multiply-Add operations per fetched data from the global memory. In comparison to the latest Cudnn library developed by Nvidia aimed to accelerate the deep-learning computation, the average performance improvement by our research is 2.6X for the single-channel, and 1.4X for the multi-channel.

下载PDF全文

下载文献需遵守相关版权规定

论文标题