论文标题
在性能的位置:量化大量3D堆叠缓存对HPC工作负载的影响
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads
论文作者
论文摘要
在过去的三十年中,内存子系统中的创新主要针对克服数据运动瓶颈。在本文中,我们关注内存技术的特定市场趋势:3D堆叠的内存和缓存。我们研究了将未来HPC的处理器(尤其是3D堆叠的SRAM)扩展在未来HPC的处理器中扩展片上存储器功能的影响。首先,我们提出了一种忽略内存子系统的方法,以在消除数据移动成本时评估性能改进的上限。然后,使用GEM5模拟器,我们对假设的大型缓存处理器(LARC)的两个变体进行建模,该变体以1.5 nm制造,并具有高容量的3D堆叠缓存。通过一系列涉及一系列代理应用和基准的实验,我们旨在揭示HPC CPU性能将如何发展,并以每芯片为基础,以缓存敏感的HPC应用的平均提升为9.56倍。此外,我们详尽地记录了我们的方法论探索,以激励HPC中心通过增强的共同设计来推动自己的技术议程。
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.