IWSLT 2022方言和低资源语音翻译任务的贸易联盟系统

论文标题

IWSLT 2022方言和低资源语音翻译任务的贸易联盟系统

ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks

论文作者

Boito, Marcely Zanon, Ortega, John, Riguidel, Hugo, Laurent, Antoine, Barrault, Loïc, Bougares, Fethi, Chaabani, Firas, Nguyen, Ha, Barbier, Florentin, Gahbiche, Souhir, Estève, Yannick

论文摘要

本文介绍了IWSLT 2022评估活动中为两个挑战赛开发的贸易联盟翻译系统：低资源和方言语音翻译。对于突尼斯阿拉伯语英语数据集（低资源和方言轨道），我们构建了端到端模型作为联合主要提交，并将其与利用大型微调WAV2VEC 2.0 ASR的级联模型进行比较。我们的结果表明，在我们的设置管道中，方法仍然非常有竞争力，并且通过使用转移学习，它们可以优于语音翻译（ST）的端到端模型。对于Tamasheq-french数据集（低资源轨道），我们的主要提交利用了在Tamasheq Audio 234小时训练的WAV2VEC 2.0型号中的中间表示，而我们的对比模型则使用法国语音转录Tamasheq Audio作为Tamasheq Audio作为Enformer语音的输入作为Enformer语音翻译识别型体系的输入，该型号是经过培训的型号，并损失了自动培训，并在自动上及其自动启发。我们的结果强调，与大型现成的模型相比，在较小的目标数据集上接受培训的自制模型对低资源的端到端ST微调更有效。结果还表明，即使近似语音转录也可以改善ST分数。

This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR. Our results show that in our settings pipeline approaches are still very competitive, and that with the use of transfer learning, they can outperform end-to-end models for speech translation (ST). For the Tamasheq-French dataset (low-resource track) our primary submission leverages intermediate representations from a wav2vec 2.0 model trained on 234 hours of Tamasheq audio, while our contrastive model uses a French phonetic transcription of the Tamasheq audio as input in a Conformer speech translation architecture jointly trained on automatic speech recognition, ST and machine translation losses. Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models. Results also illustrate that even approximate phonetic transcriptions can improve ST scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题