论文标题

部分可观测时空混沌系统的无模型预测

IAAT: A Input-Aware Adaptive Tuning framework for Small GEMM

论文作者

Yao, Jianyu, Shi, Boqian, Xiang, Chunyang, Jia, Haipeng, Li, Chendi, Cao, Hang, Zhang, Yunquan

论文摘要

具有少量输入矩阵的GEMM已在HPC和机器学习等许多领域广泛使用。尽管许多著名的Blas图书馆已经支持小宝石,但它们无法实现近乎最佳的性能。这是因为包装操作的成本很高,频繁的边界处理不能忽略。本文提出了一个输入感知的自适应调整框架(IAAT),以克服最先进的实现中的性能瓶颈。 IAAT由两个阶段组成,即安装时间阶段和运行时间阶段。在运行时阶段,IAAT瓷砖矩阵分为块,以减轻边界处理。该阶段利用输入感知的自适应瓷砖算法,并扮演运行时调整的角色。在安装时间阶段,IAAT自动生成数百个不同尺寸的内核以删除包装操作。最后,IAAT通过调用不同的内核来完成对小宝石的计算,这对应于块的大小。实验结果表明,与ARMV8平台上的其他BLAS库相比,IAAT的性能更好。

GEMM with the small size of input matrices is becoming widely used in many fields like HPC and machine learning. Although many famous BLAS libraries already supported small GEMM, they cannot achieve near-optimal performance. This is because the costs of pack operations are high and frequent boundary processing cannot be neglected. This paper proposes an input-aware adaptive tuning framework(IAAT) for small GEMM to overcome the performance bottlenecks in state-of-the-art implementations. IAAT consists of two stages, the install-time stage and the run-time stage. In the run-time stage, IAAT tiles matrices into blocks to alleviate boundary processing. This stage utilizes an input-aware adaptive tile algorithm and plays the role of runtime tuning. In the install-time stage, IAAT auto-generates hundreds of kernels of different sizes to remove pack operations. Finally, IAAT finishes the computation of small GEMM by invoking different kernels, which corresponds to the size of blocks. The experimental results show that IAAT gains better performance than other BLAS libraries on ARMv8 platform.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源