Apple M系列统一芯片上的金属阴影语言的基于C ++物理的无缝GPU加速度

论文标题

Apple M系列统一芯片上的金属阴影语言的基于C ++物理的无缝GPU加速度

Seamless GPU acceleration for C++ based physics with the Metal Shading Language on Apple's M series unified chips

论文作者

Gebraad, Lars, Fichtner, Andreas

论文摘要

Apple生产的M系列芯片已证明是日常任务的主流Intel和AMD X86处理器的功能强大且有效的替代品。此外，统一的设计集成了中央处理和图形处理单元，使这些M系列芯片可以在许多具有大量图形要求的任务上出色，而无需离散的图形处理单元（GPU），在某些情况下甚至超过离散的GPU。在这项工作中，我们展示了如何使用金属阴影语言（MSL）在C ++中加速典型的数组操作。更重要的是，我们通过允许中央处理单元（CPU）和GPU在统一的内存中工作，从而显示MSL的使用如何避免CUDA或OpenACC内存管理的典型复杂性。我们证明了M系列芯片在标准的一维和二维阵列操作中的性能如何，例如阵列添加，Saxpy和有限差模板，相对于串行和OpenMP加速CPU代码。实现MSL的复杂性的降低还使我们能够使用MSL加速现有的弹性波方程求解器（最初基于OpenMP加速C ++），同时保持所有CPU和OpenMP功能。模拟波方程的结果性能增益接近特定设置的数量级。通过使用MSL获得的这种收益类似于其他GPU加速波传播代码，相对于其CPU变体，但并未达到太大的编程复杂性，禁止典型的科学程序员利用这些加速器。该结果表明，统一的处理单元如何成为地震学家和计算科学家的宝贵工具，从而将标准降低到编写利用现代GPU的表现量代码。

The M series of chips produced by Apple have proven a capable and power-efficient alternative to mainstream Intel and AMD x86 processors for everyday tasks. Additionally, the unified design integrating the central processing and graphics processing unit, have allowed these M series chips to excel at many tasks with heavy graphical requirements without the need for a discrete graphical processing unit (GPU), and in some cases even outperforming discrete GPUs. In this work, we show how the M series chips can be leveraged using the Metal Shading Language (MSL) to accelerate typical array operations in C++. More importantly, we show how the usage of MSL avoids the typical complexity of CUDA or OpenACC memory management, by allowing the central processing unit (CPU) and GPU to work in unified memory. We demonstrate how performant the M series chips are on standard one-dimensional and two-dimensional array operations such as array addition, SAXPY and finite difference stencils, with respect to serial and OpenMP accelerated CPU code. The reduced complexity of implementing MSL also allows us to accelerate an existing elastic wave equation solver (originally based on OpenMP accelerated C++) using MSL, with minimal effort, while retaining all CPU and OpenMP functionality. The resulting performance gain of simulating the wave equation is near an order of magnitude for specific settings. This gain attained from using MSL is similar to other GPU-accelerated wave-propagation codes with respect to their CPU variants, but does not come at much increased programming complexity that prohibits the typical scientific programmer to leverage these accelerators. This result shows how unified processing units can be a valuable tool to seismologists and computational scientists in general, lowering the bar to writing performant codes that leverage modern GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题