论文标题
具有变压器的多解析和多模式语音识别
Multiresolution and Multimodal Speech Recognition with Transformers
论文作者
论文摘要
本文使用基于变压器的体系结构介绍了视觉自动语音识别(AV-ASR)系统。我们特别关注视觉信息提供的场景上下文,以扎根ASR。我们使用其他交叉模式多头注意力层提取变压器编码器和融合视频功能的音频功能的表示形式。此外,我们还合并了用于多分辨率ASR的多任务训练标准,在此训练模型以生成字符和子字级转录。 HOW2数据集的实验结果表明,多分辨率训练可以使收敛速度降低约50%,并且相对可将单词错误率(WER)的性能提高了18%,超过子单词预测模型。此外,合并视觉信息可改善性能,仅比音频模型高达3.76%。 我们的结果可与最先进的聆听,参加,基于法术的架构相媲美。
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information, to ground the ASR. We extract representations for audio features in the encoder layers of the transformer and fuse video features using an additional crossmodal multihead attention layer. Additionally, we incorporate a multitask training criterion for multiresolution ASR, where we train the model to generate both character and subword level transcriptions. Experimental results on the How2 dataset, indicate that multiresolution training can speed up convergence by around 50% and relatively improves word error rate (WER) performance by upto 18% over subword prediction models. Further, incorporating visual information improves performance with relative gains upto 3.76% over audio only models. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.