联合语音转录和翻译：带有分发数据的伪标记

论文标题

联合语音转录和翻译：带有分发数据的伪标记

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

论文作者

Gheini, Mozhdeh, Likhomanenko, Tatiana, Sperber, Matthias, Setiawan, Hendra

论文摘要

自我训练已被证明有助于解决许多领域的数据稀缺，包括视觉，语音和语言。具体而言，自我训练或伪标记标记了无监督的数据，并将其添加到训练池中。在这项工作中，我们调查并使用伪标记进行最近提出的小说设置：言语的联合转录和翻译，这遭受了没有足够的数据资源的困扰。我们表明，在这种缺陷的情况下，未标记的数据可能因监督数据而在域上有很大变化，从而导致伪标签质量降级。我们研究了不需要其他监督的两类补救措施，并针对域不匹配：伪标签过滤和数据增强。我们表明，伪标签的分析和处理，因此导致香草伪标记设置的额外收益，从而导致总体改善高达0.6％的绝对含量和2.2个BLEU点。

Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题