论文标题

SHAS:接近端到端语音翻译的最佳细分

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

论文作者

Tsiamas, Ioannis, Gállego, Gerard I., Fonollosa, José A. R., Costa-jussà, Marta R.

论文摘要

语音翻译模型无法直接处理较长的音频,例如TED Talks,必须将其分为较短的段。语音翻译数据集提供了音频的手动分割,这些音频在现实世界中不可用,而现有的分割方法通常会在推理时大大降低翻译质量。为了弥合训练的手动分割与推理的自动分割之间的差距,我们提出了有监督的混合音频分割(SHAS),该方法可以有效地从任何手动分段语音语料库中学习最佳分割。首先,我们使用预先训练的WAV2VEC 2.0的语音表示形式来训练分类器,以识别分段中所包含的帧。然后,通过概率分裂和诱导算法找到最佳的分裂点,该算法逐渐在最低概率的框架框架下逐渐分裂,直到所有段都低于预先指定的长度。关于Mast-C和MedX的实验表明,通过我们的方法生成的片段的翻译方法将手动分割的质量在5对语言对上进行。也就是说,SHAS保留了手动细分的BLEU分数的95-98%,而现有方法的87-93%。我们的方法还可以推广到不同的域,并以看不见的语言实现高零射击性能。

Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 language pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源