快速形式：高效的变压器模型，用于自然语言理解

论文标题

快速形式：高效的变压器模型，用于自然语言理解

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

论文作者

Kim, Young Jin, Awadalla, Hany Hassan

论文摘要

基于变压器的模型是自然语言理解（NLU）应用程序的最新模型。模型在各种任务上都变得越来越大。但是，变压器模型在计算上仍然具有挑战性，因为与传统方法相比，推理时间不高。在本文中，我们介绍了快速形式，这是一组配方，以在各种NLU任务上为基于变压器的模型实现有效的推理时间性能。我们展示了仔细利用知识蒸馏，结构化的修剪和数值优化如何导致推理效率的急剧提高。我们提供有效的食谱，可以指导从业人员为各种NLU任务和验证模型选择最佳设置。将提议的食谱应用于Superglue基准测试中，与CPU上的开箱即用型号相比，我们实现了9.8倍至233.9倍的速度。在GPU上，我们还通过提出的方法达到了高达12.4倍的速度。我们表明，快速形式可以大幅度地将服务要求的成本从4,223美元降低到Azure F16S_V2实例的18美元。这可以通过减少能源消耗6.9倍-125.8倍的可持续运行时，根据2020年SCAURNTLP共享任务的指标。

Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On GPU, we also achieve up to 12.4x speed-up with the presented methods. We show that FastFormers can drastically reduce cost of serving 100 million requests from 4,223 USD to just 18 USD on an Azure F16s_v2 instance. This translates to a sustainable runtime by reducing energy consumption 6.9x - 125.8x according to the metrics used in the SustaiNLP 2020 shared task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题