论文标题
GPU加速自动分化与外壳
GPU Accelerated Automatic Differentiation With Clad
论文作者
论文摘要
自动差异(AD)对科学和工业有用。它是评估通过计算机程序指定的函数的导数的工具。广告应用领域的范围从机器学习到机器人技术到高能量物理。借助AD的计算梯度比数值替代方案更为精确,并且与原始功能相比,具有较低,恒定的因子更算术操作。此外,针对域问题的广告应用程序通常是计算上的。它们通常受到高维参数的计算要求的限制,因此可以受益于图形处理单元(GPU)的并行实现。 CLAD的目的是为C/C ++和CUDA提供差异分析,并且是编译器辅助的AD工具,既可以作为编译器扩展,又在root中。此外,外壳是扩展clang编译器的插件。作为扩展交互式解释器的插件;并作为基于Xeus-cling的Jupyter内核扩展。我们证明了与clad的GPU上平行梯度计算的优势。我们解释了如何通过扩展包装来支持CUDA来提高新的优化和比例速度。可以在GPU上自动执行良好的C ++功能的梯度。该库可以轻松地集成到现有框架中或交互式使用。此外,我们证明了已实现的应用程序性能改进,包括根直方图拟合中的(〜10倍)以及从卸载到GPU的相应性能提高。
Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have a low, constant factor more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional parameters and thus can benefit from parallel implementations on graphics processing units (GPUs). Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling. We demonstrate the advantages of parallel gradient computations on GPUs with Clad. We explain how to bring forth a new layer of optimization and a proportional speed up by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. The library can be easily integrated into existing frameworks or used interactively. Furthermore, we demonstrate the achieved application performance improvements, including (~10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs.