论文标题
Ones -Weep:GPU的最低数字radix排序更快的速度
Onesweep: A Faster Least Significant Digit Radix Sort for GPUs
论文作者
论文摘要
我们提出了wayeep,这是一个最不重要的数字(LSD)radix排序算法,该算法用于全球内存中的大型GPU排序问题。我们的并行算法采用了一种单频前缀总和的方法,该方法仅需要〜2n全局读/写操作,每个数字式迭代。这表现出最后级别的内存流量与当代GPU radix排序实现的显着降低,其中每次迭代的数字binning都需要通过数据集进行两次通过〜3N全局内存操作。 在NVIDIA A100 GPU上,我们的方法在排序256m随机32位键时达到29.4 GKEY/S。与CUB相比,当前最新的GPU LSD radix排序,我们的方法提供了约1.5倍的速度。对于具有不同分布的32位键,与HRS,最新的GPU MSD Radix排序相比,我们的方法提供了更一致的性能,并且在几乎所有情况下都表现出色。
We present Onesweep, a least-significant digit (LSD) radix sorting algorithm for large GPU sorting problems residing in global memory. Our parallel algorithm employs a method of single-pass prefix sum that only requires ~2n global read/write operations for each digit-binning iteration. This exhibits a significant reduction in last-level memory traffic versus contemporary GPU radix sorting implementations, where each iteration of digit binning requires two passes through the dataset totaling ~3n global memory operations. On the NVIDIA A100 GPU, our approach achieves 29.4 GKey/s when sorting 256M random 32-bit keys. Compared to CUB, the current state-of-the-art GPU LSD radix sort, our approach provides a speedup of ~1.5x. For 32-bit keys with varied distributions, our approach provides more consistent performance compared to HRS, the current state-of-the-art GPU MSD radix sort, and outperforms it in almost all cases.