花岗岩：基本块吞吐量估计的图形神经网络模型

论文标题

花岗岩：基本块吞吐量估计的图形神经网络模型

GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation

论文作者

Sykora, Ondrej, Phothilimthana, Phitchaya Mangpo, Mendis, Charith, Yazdanbakhsh, Amir

论文摘要

分析硬件性能模型可以快速估计所需的硬件性能指标。但是，使用复杂的微体系结构为现代处理器开发这些分析模型是一项非常费力的任务，需要对目标微体系结构的内部结构有一定的了解。在本文中，我们介绍了花岗岩，这是一种新的机器学习模型，该模型估计了不同微体系构造的基本块的吞吐量。花岗岩使用基本块的图表表示，该块在指令之间捕获结构和数据依赖性。使用图形神经网络对此表示形式进行处理，该图形神经网络利用图中捕获的关系信息，并了解了基本块的丰富神经表示，允许更精确的吞吐量估计。我们的结果为基本块性能估计建立了一个新的最新最先进，在X86-64目标的广泛基本块和微体系结构中，平均测试误差为6.9％。与最近的工作相比，这将误差降低了1.7％，同时将训练和推理吞吐量提高了约3.0倍。此外，我们建议将多任务学习与独立的多层馈电式解码器网络一起使用。我们的结果表明，这项技术进一步提高了所有学习模型的精度，同时大大降低了每次体系结构培训成本。我们与先前的工作进行了一系列大量的消融研究和比较，得出了一组方法，以实现基本块性能估计的高精度。

Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% while improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题