论文标题
简要公告:关于GPU的卷积神经网络并行的限制
Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs
论文作者
论文摘要
GPU目前是培训神经网络的首选平台。但是,培训深度神经网络(DNN)即使在GPU上也是一个耗时的过程,因为必须学习的参数数量大量。结果,在过去的几年中,加速DNN培训一直是一项重大研究的领域。 尽管较早的网络(例如Alexnet)在层和操作之间具有线性依赖性,但诸如Resnet,PathNet和Googlenet之类的最新网络具有非线性结构,该结构表现出更高水平的操作间并行性。但是,诸如Tensorflow和Pytorch之类的流行深度学习(DL)框架启动了大多数神经网络操作,尤其是卷积,在GPU上串行,并且不利用这种相互平行性。在此简短的公告中,我们为在最先进的非线性网络中利用这种丰富的并行性的需求和潜在利益提出了理由,以减少培训时间。我们确定在DL框架的GPU后端(例如Cudnn)上同时执行并提出潜在解决方案的挑战和局限性。
GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result, accelerating DNN training has been an area of significant research in the last couple of years. While earlier networks such as AlexNet had a linear dependency between layers and operations, state-of-the-art networks such as ResNet, PathNet, and GoogleNet have a non-linear structure that exhibits a higher level of inter-operation parallelism. However, popular deep learning (DL) frameworks such as TensorFlow and PyTorch launch the majority of neural network operations, especially convolutions, serially on GPUs and do not exploit this inter-op parallelism. In this brief announcement, we make a case for the need and potential benefit of exploiting this rich parallelism in state-of-the-art non-linear networks for reducing the training time. We identify the challenges and limitations in enabling concurrent layer execution on GPU backends (such as cuDNN) of DL frameworks and propose potential solutions.