平衡成本和利益与绑扎的变形金刚

论文标题

平衡成本和利益与绑扎的变形金刚

Balancing Cost and Benefit with Tied-Multi Transformers

论文作者

Dabre, Raj, Rubino, Raphael, Fujita, Atsushi

论文摘要

我们提出并评估了一个新的程序，用于训练多个绑定参数，该程序将多个模型压缩到一个模型中，从而可以在解码过程中的编码器和解码器层的动态选择。通常，在序列到序列建模中，通常将N层编码器的最后一层输出馈送到M层解码器，并且最后一个解码器层的输出用于计算损耗。取而代之的是，我们的方法计算由NXM损耗组成的单个损失，其中每个损耗都是从连接到N编码器层之一的M解码器层的输出中计算出来的。这样的模型包含具有不同数量的编码器和解码器层的NXM模型，可用于解码，少于最大数量的编码器和解码器层。然后，我们提出了一种机制，可以先选择一个用于更快解码的编码器和解码器层的数量，并探索层的经常堆叠和知识蒸馏以进行模型压缩。我们提出了将提议的方法应用于神经机器翻译的成本效益分析，并表明它们在保留翻译质量的同时降低了解码成本。

We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. We then propose a mechanism to choose a priori the number of encoder and decoder layers for faster decoding, and also explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题