论文标题
与增强内存变压器的流式传输同时的语音翻译
Streaming Simultaneous Speech Translation with Augmented Memory Transformer
论文作者
论文摘要
基于变压器的模型已在语音翻译任务上实现了最新的性能。但是,由于在整个输入序列上计算自我注意力,并且计算成本随输入序列的长度增长,因此模型体系结构不足以流式传输方案。然而,先前关于同时语音翻译的大多数工作,从部分音频输入中生成翻译的任务都忽略了分析延迟时生成翻译所花费的时间。有了这个假设,系统可能具有良好的延迟质量折衷,但在实时情况下是不适用的。在本文中,我们专注于同时流式语音翻译的任务,在该任务中,系统不仅能够进行部分输入进行翻译,而且还可以处理非常长或连续的输入。我们提出了一个配备了增强内存变压器编码器的基于端到端变压器的序列到序列模型,该模型在使用基于混合或传感器的模型的流式自动语音识别任务上显示出巨大的成功。我们对段,上下文和内存大小的拟议模型进行了经验评估,并将我们的方法与单向掩模进行了比较。
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in generating the translation when analyzing the latency. With this assumption, a system may have good latency quality trade-offs but be inapplicable in real-time scenarios. In this paper, we focus on the task of streaming simultaneous speech translation, where the systems are not only capable of translating with partial input but are also able to handle very long or continuous input. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder, which has shown great success on the streaming automatic speech recognition task with hybrid or transducer-based models. We conduct an empirical evaluation of the proposed model on segment, context and memory sizes and we compare our approach to a transformer with a unidirectional mask.