论文标题
SIMD的科学数据有损压缩
SIMD Lossy Compression for Scientific Data
论文作者
论文摘要
现代的HPC应用程序产生越来越多的数据,这限制了当前极端系统的性能。数据降低技术(例如有损耗压缩)通过减少这些应用程序生成的数据大小来帮助减轻此问题。 SZ是一种当前的最新有损压缩机,能够达到高压缩比,但是所使用的预测/量化方法引入了依赖关系,从而阻止了压缩的这一步骤并行。最近的工作提出了一种对GPU的平行双重预测/量化算法,该算法可以消除这些依赖性。但是,某些HPC系统和应用程序不使用GPU,并且仍然可以从该方法的细颗粒并行性中受益。使用双量化技术,我们实现并优化了SZ的SIMD矢量化CPU版本,并创建一个用于选择最佳块大小和向量长度的启发式。我们还研究了非零块填充值的效果,以减少沿压缩块边界不可预测值的数量。我们使用双量化,PSZ以及SZ-1.4来测量VECSZ对O3优化的CPU版本的SZ的性能。我们使用现实世界的科学数据集评估了Intel Skylake和AMD Rome架构上的矢量化版本Vecsz。我们发现,对于某些配置,应用替代填充物可将离群值的数量减少100 \%。我们的实施还导致速率率高32 \%提高,超过SZ-1.4的速度高达15 $ \ times $速度,实现了超过3.4 GB/s的预测和量化带宽。
Modern HPC applications produce increasingly large amounts of data, which limits the performance of current extreme-scale systems. Data reduction techniques, such as lossy compression, help to mitigate this issue by decreasing the size of data generated by these applications. SZ, a current state-of-the-art lossy compressor, is able to achieve high compression ratios, but the prediction/quantization methods used introduce dependencies which prevent parallelizing this step of the compression. Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. Using the dual-quantization technique, we implement and optimize a SIMD vectorized CPU version of SZ, and create a heuristic for selecting the optimal block size and vector length. We also investigate the effect of non-zero block padding values to decrease the number of unpredictable values along compression block borders. We measure performance of vecSZ against an O3 optimized CPU version of SZ using dual-quantization, pSZ, as well as SZ-1.4. We evaluate our vectorized version, vecSZ, on the Intel Skylake and AMD Rome architectures using real-world scientific datasets. We find that applying alternative padding reduces the number of outliers by 100\% for some configurations. Our implementation also results in up to 32\% improvement in rate-distortion and up to 15$\times$ speedup over SZ-1.4, achieving a prediction and quantization bandwidth in excess of 3.4 GB/s.