论文标题
优化Huffman在GPU上为错误结合的有损压缩解码
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs
论文作者
论文摘要
越来越多的HPC应用程序需要快速有效的压缩技术来处理大量存储和传输中的数据。这些应用不仅需要在模拟过程中有效地压缩数据,而且还需要有效地进行减压以进行事后分析。 SZ是用于科学数据的错误有损压缩机,Cusz是SZ的版本,旨在利用GPU的功率。目前,Cusz的压缩性能已得到了显着优化,而其减压性的性能仍然大大降低,因为其精致的无损压缩步骤 - 定制的Huffman解码。在这项工作中,我们旨在显着提高霍夫曼(Huffman)解码性能,从而改善了整体减压性能。为此,我们首先研究了两个最先进的GPU Huffman解码器。然后,我们为这两种算法提出了深层的体系结构优化。具体来说,我们通过在解码/写作阶段使用共享存储器,在线调整共享内存的量,改善内存访问模式并减少扭曲差异来充分利用CUDA GPU架构。最后,我们使用八个代表性的科学数据集评估了在NVIDIA V100 GPU上的优化解码器。我们的新解码解决方案比Cusz的Huffman解码器获得了3.64倍的平均速度,并将其整体减压性能平均提高了2.43倍。
More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step -- a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.