CPU-GPU统一虚拟内存中的基于深度学习的数据预取

论文标题

CPU-GPU统一虚拟内存中的基于深度学习的数据预取

Deep Learning based Data Prefetching in CPU-GPU Unified Virtual Memory

论文作者

Long, Xinjian, Gong, Xiangyang, Zhou, Huiyang

论文摘要

统一的虚拟内存（UVM）通过启用CPU内存和GPU内存之间的按需数据移动来使开发人员从维护复杂的数据结构和显式数据迁移的责任中减轻开发人员。但是，按需分页很快就成为UVM的性能瓶颈，这是因为页面桌步行引起的高潜伏期和互连的数据迁移。鉴于其能够利用程序内存访问模式的局部性，预摘要被认为是该问题的有前途解决方案。但是，现有的基于本地性的预取计划无法处理所有情况。％数据结构（如数组）往往存储在连续的块中，并反复访问。理想的预摘要不仅应查看所请求的地址空间的狭窄区域，而且还应捕获全局上下文以对内存访问模式进行良好的预测。本文提出了一种新颖的方法，用于通过深度学习为UVM预摘要。我们首先表明，强大的变压器学习模型可以为UVM页面预取可提供高精度。然后，我们执行分析以解释该变压器模型并得出几种见解，使我们能够设计一个更简单的模型，以将不受约束的模型的精度与成本较低的数量级相匹配。我们在流行的基准套件的11个内存密集型基准上评估了这一简化模型。我们的解决方案优于最先进的UVM框架，将性能提高了10.89％，将设备存储页的命中率提高了16.98％（在先前的ART中为89.02％和76.10％），并将CPU-GPU互连流量降低11.05％。根据我们提出的统一度量，结合了准确性，覆盖范围和页面命中率，我们的解决方案比最先进的设计更接近理想的预取方案（0.90 vs. 0.85，以及1.0的完美预定器）。

Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon becomes a performance bottleneck of UVM due to the high latency caused by page table walks and data migration over interconnect. Prefetching is considered a promising solution to this problem given its ability to leverage the locality of program memory access patterns. However, existing locality-based prefetching schemes can not handle all the situations. %Data structures like arrays tend to be stored in contiguous blocks, and accessed repeatedly. An ideal prefetcher should not only look at narrow regions of the requested address space but also capture global context to deliver a good prediction of the memory access pattern. This paper proposes a novel approach for page prefetching for UVM through deep learning. We first show that a powerful Transformer learning model can provide high accuracy for UVM page prefetching. We then perform analysis to interpret this Transformer model and derive several insights that allow us to design a simpler model to match the unconstrained model's accuracy with orders of magnitude lower cost. We evaluate this simplified model on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98% (89.02% vs. 76.10% for prior art), and reducing the CPU-GPU interconnect traffic by 11.05%. According to our proposed unified metric, which combines the accuracy, coverage, and page hit rate, our solution is approaching the ideal prefetching scheme more than the state-of-the-art design (0.90 vs. 0.85, with the perfect prefetcher of 1.0).

下载PDF全文

下载文献需遵守相关版权规定

论文标题