dalorex：一个数据本地程序执行和用于内存的应用程序的架构

论文标题

dalorex：一个数据本地程序执行和用于内存的应用程序的架构

Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications

论文作者

Orenes-Vera, Marcelo, Tureci, Esin, Wentzlaff, David, Martonosi, Margaret

论文摘要

数据重复使用且频繁的不规则内存访问（例如图形或稀疏线性代数工作负载）的应用程序由于内存瓶颈和较差的核心利用率而无法缩放。在先前使用预取，去耦或管道上的工作可以减轻内存延迟并改善核心利用率，但由于离芯片外带宽有限，内存瓶颈持续存在。与混合记忆立方体（HMC）进行处理中内存（PIM）的方法克服了带宽的限制，但由于任务调度和同步开销较差，因此无法实现高核心利用率。此外，可提供具有HMC限制强缩放的高内存核心比率。我们介绍了Dalorex，这是一种硬件软件共同设计，可实现高平行性和能源效率，在处理图和稀疏线性代数工作负载时，用> 16,000个内核表现出强大的缩放。在PIM的先前工作中，均使用256个内核，Dalorex通过（1）通过（1）基于瓷砖的分布式内存架构来提高性能和能源消耗，其中每个处理瓷砖都具有相等的数据，并且所有内存操作都是局部的；（2）一种基于任务的并行编程模型，其中由与目标数据共同分配的处理单元执行任务；（3）针对不规则流量优化的网络设计，其中所有通信都是单向的，并且消息不包含路由元数据；（4）保持高核心利用率的新型流量感知任务计划硬件；（5）提高工作平衡的数据放置策略。这项工作提出了架构和软件创新，以提供迄今为止运行图算法的最大可扩展性，同时仍可以为其他域进行编程。

Applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing processing in-memory (PIM) with Hybrid Memory Cube (HMC) overcome bandwidth limitations but fail to achieve high core utilization due to poor task scheduling and synchronization overheads. Moreover, the high memory-per-core ratio available with HMC limits strong scaling. We introduce Dalorex, a hardware-software co-design that achieves high parallelism and energy efficiency, demonstrating strong scaling with >16,000 cores when processing graph and sparse linear algebra workloads. Over the prior work in PIM, both using 256 cores, Dalorex improves performance and energy consumption by two orders of magnitude through (1) a tile-based distributed-memory architecture where each processing tile holds an equal amount of data, and all memory operations are local; (2) a task-based parallel programming model where tasks are executed by the processing unit that is co-located with the target data; (3) a network design optimized for irregular traffic, where all communication is one-way, and messages do not contain routing metadata; (4) novel traffic-aware task scheduling hardware that maintains high core utilization; and (5) a data placement strategy that improves work balance. This work proposes architectural and software innovations to provide the greatest scalability to date for running graph algorithms while still being programmable for other domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题