了解Intel Broadwell和Cascade Lake处理器上的HPC基准性能

论文标题

了解Intel Broadwell和Cascade Lake处理器上的HPC基准性能

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

论文作者

Alappat, Christie L., Hofmann, Johannes, Hager, Georg, Fehske, Holger, Bishop, Alan R., Wellein, Gerhard

论文摘要

即使仅考虑多核CPU，高性能计算中的硬件平台也会越来越复杂。大多数应用程序用户或开发人员甚至都不知道硬件和软件环境中的许多功能和配置选项。 Microbenchs，即，简单的代码来理解硬件的特定方面，可以帮助您阐明此类问题，但只有当它们得到充分了解，并且结果是否可以与已知的事实或性能模型相吻合。然后，从微基准中获得的洞察力可以应用于实际应用，以进行性能分析或优化。在本文中，我们深入研究了两个现代Intel X86 Server CPU架构：Broadwell EP和Cascade Lake Sp。我们重点介绍了相关的硬件配置设置，这些设置可能会对代码性能产生决定性影响，并展示如何正确测量芯片和芯片外数据传输带宽。新的受害者L3 Cache的Cascade Lake及其高级替代政策受到了应有的关注。最后，我们使用DGEMM，稀疏矩阵向量乘法和HPCG基准测试与相关应用程序方案建立连接。

Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the hardware, can help to shed light on such issues, but only if they are well understood and if the results can be reconciled with known facts or performance models. The insight gained from microbenchmarks may then be applied to real applications for performance analysis or optimization. In this paper we investigate two modern Intel x86 server CPU architectures in depth: Broadwell EP and Cascade Lake SP. We highlight relevant hardware configuration settings that can have a decisive impact on code performance and show how to properly measure on-chip and off-chip data transfer bandwidths. The new victim L3 cache of Cascade Lake and its advanced replacement policy receive due attention. Finally we use DGEMM, sparse matrix-vector multiplication, and the HPCG benchmark to make a connection to relevant application scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题