论文标题

联合语音转录和翻译:带有分发数据的伪标记

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

论文作者

Gheini, Mozhdeh, Likhomanenko, Tatiana, Sperber, Matthias, Setiawan, Hendra

论文摘要

自我训练已被证明有助于解决许多领域的数据稀缺,包括视觉,语音和语言。具体而言,自我训练或伪标记标记了无监督的数据,并将其添加到训练池中。在这项工作中,我们调查并使用伪标记进行最近提出的小说设置:言语的联合转录和翻译,这遭受了没有足够的数据资源的困扰。我们表明,在这种缺陷的情况下,未标记的数据可能因监督数据而在域上有很大变化,从而导致伪标签质量降级。我们研究了不需要其他监督的两类补救措施,并针对域不匹配:伪标签过滤和数据增强。我们表明,伪标签的分析和处理,因此导致香草伪标记设置的额外收益,从而导致总体改善高达0.6%的绝对含量和2.2个BLEU点。

Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源