TLP：张量计划调整的基于深度学习的成本模型

论文标题

TLP：张量计划调整的基于深度学习的成本模型

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

论文作者

Zhai, Yi, Zhang, Yu, Liu, Shuo, Chu, Xiaomeng, Peng, Jie, Ji, Jianmin, Zhang, Yanyong

论文摘要

张量程序调整是一个非凸客观优化问题，证明基于搜索的方法是有效的。基于搜索的方法的核心是成本模型的设计。尽管基于深度学习的成本模型的性能要比其他方法要好得多，但它们仍然缺乏以下问题。首先，它们的功能提取在很大程度上取决于硬件体系结构中的专家级领域知识。即使这样，提取的特征通常也不令人满意，需要单独考虑CPU和GPU。其次，在一个硬件平台上训练的成本模型通常在另一个硬件平台上表现不佳，这是我们称之为跨硬件的问题。为了解决这些问题，我们提出了TLP和MTLTLP。 TLP是一种基于深度学习的成本模型，可促进张量程序调整。 TLP不是从张量程序本身中提取功能，而是从计划原始图中提取功能。我们将计划原语视为张量语言。因此，TLP是张量语言处理任务。这样，通过成本模型预测张量程序延迟的任务将转换为自然语言处理（NLP）回归任务。 MTL-TLP结合了多任务学习和TLP，以应对跨硬件的不可用问题。我们将这些技术纳入ANSOR框架，并进行详细的实验。结果表明，与最新的实施相比，TLP可以在CPU和GPU工作负载上分别加快平均搜索时间和3.0倍的速度。 MTL-TLP可以在CPU和GPU工作量上分别实现4.7倍和2.9倍的速度，仅使用目标硬件数据的7％。

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their feature extraction heavily relies on expert-level domain knowledge in hardware architectures. Even so, the extracted features are often unsatisfactory and require separate considerations for CPUs and GPUs. Second, a cost model trained on one hardware platform usually performs poorly on another, a problem we call cross-hardware unavailability. In order to address these problems, we propose TLP and MTLTLP. TLP is a deep learning-based cost model that facilitates tensor program tuning. Instead of extracting features from the tensor program itself, TLP extracts features from the schedule primitives. We treat schedule primitives as tensor languages. TLP is thus a Tensor Language Processing task. In this way, the task of predicting the tensor program latency through the cost model is transformed into a natural language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning and TLP to cope with the cross-hardware unavailability problem. We incorporate these techniques into the Ansor framework and conduct detailed experiments. Results show that TLP can speed up the average search time by 9.1X and 3.0X on CPU and GPU workloads, respectively, compared to the state-of-the-art implementation. MTL-TLP can achieve a speed-up of 4.7X and 2.9X on CPU and GPU workloads, respectively, using only 7% of the target hardware data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题