论文标题
自我监督的音频和文本预培训具有极低的资源并行数据
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
论文作者
论文摘要
最近已被证明对音频和文本的多模式预训练是有效的,并且已经显着改善了许多下游语音理解任务的性能。但是,只有在提供大量平行的音频和文本数据时,这些最先进的预训练音频文本模型才能很好地工作,这带来了许多富含单峰语料库但稀缺平行跨模式语料库的语言的挑战。在本文中,我们研究是否可以预先培训具有极低的资源并行数据和额外非并行单峰数据的音频文本多模型。我们的预训练框架由以下组件组成:(1)模式内denoising自动编码(IDAE),该组件能够从嘈杂的版本中重建输入文本(音频)表示。 (2)预先训练以重建输入文本(音频)的跨模式Denoising自动编码(CDAE),均给定输入文本(音频)和相应翻译的噪声音频功能(文本嵌入)的噪声版本。 (3)迭代授予过程(IDP),迭代地翻译了原始音频(文本)和相应的文本嵌入(音频功能),从以前的迭代转换为新的较不含糊不清的文本嵌入式(音频功能)。我们将双重跨模式变压器改编成我们的骨干模型,该模型由IDAE的两个单峰编码器和CDAE和IDP的两个跨模式编码器组成。与在完全并行数据上预先训练的模型相比,我们的方法在多个下游语音理解任务上实现了可比的性能,这表明了所提出的方法的巨大潜力。我们的代码可在:\ url {https://github.com/karlyukang/low-esource-multimodal-pre-training}中获得。
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method. Our code is available at: \url{https://github.com/KarlYuKang/Low-Resource-Multimodal-Pre-training}.