SIMD的科学数据有损压缩

论文标题

SIMD的科学数据有损压缩

SIMD Lossy Compression for Scientific Data

论文作者

Dube, Griffin, Tian, Jiannan, Di, Sheng, Tao, Dingwen, Calhoun, Jon, Cappello, Franck

论文摘要

现代的HPC应用程序产生越来越多的数据，这限制了当前极端系统的性能。数据降低技术（例如有损耗压缩）通过减少这些应用程序生成的数据大小来帮助减轻此问题。 SZ是一种当前的最新有损压缩机，能够达到高压缩比，但是所使用的预测/量化方法引入了依赖关系，从而阻止了压缩的这一步骤并行。最近的工作提出了一种对GPU的平行双重预测/量化算法，该算法可以消除这些依赖性。但是，某些HPC系统和应用程序不使用GPU，并且仍然可以从该方法的细颗粒并行性中受益。使用双量化技术，我们实现并优化了SZ的SIMD矢量化CPU版本，并创建一个用于选择最佳块大小和向量长度的启发式。我们还研究了非零块填充值的效果，以减少沿压缩块边界不可预测值的数量。我们使用双量化，PSZ以及SZ-1.4来测量VECSZ对O3优化的CPU版本的SZ的性能。我们使用现实世界的科学数据集评估了Intel Skylake和AMD Rome架构上的矢量化版本Vecsz。我们发现，对于某些配置，应用替代填充物可将离群值的数量减少100 \％。我们的实施还导致速率率高32 \％提高，超过SZ-1.4的速度高达15 $ \ times $速度，实现了超过3.4 GB/s的预测和量化带宽。

Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Data reduction techniques, such as lossy compression, help to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but the prediction/quantization methods used introduce dependencies which prevent parallelizing this step of the compression. Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. Using the dual-quantization technique, we implement and optimize a SIMD vectorized CPU version of SZ, and create a heuristic for selecting the optimal block size and vector length. We also investigate the effect of non-zero block padding values to decrease the number of unpredictable values along compression block borders. We measure performance of vecSZ against an O3 optimized CPU version of SZ using dual-quantization, pSZ, as well as SZ-1.4. We evaluate our vectorized version, vecSZ, on the Intel Skylake and AMD Rome architectures using real-world scientific datasets. We find that applying alternative padding reduces the number of outliers by 100\% for some configurations. Our implementation also results in up to 32\% improvement in rate-distortion and up to 15$\times$ speedup over SZ-1.4, achieving a prediction and quantization bandwidth in excess of 3.4 GB/s.

下载PDF全文

下载文献需遵守相关版权规定

论文标题