联合演讲识别和音频字幕

论文标题

联合演讲识别和音频字幕

Joint Speech Recognition and Audio Captioning

论文作者

Narisetty, Chaitanya, Tsunoo, Emiru, Chang, Xuankai, Kashiwagi, Yosuke, Hentschel, Michael, Watanabe, Shinji

论文摘要

记录在室内和室外环境中的语音样本通常被次要音频源污染。大多数端到端的单声道语音识别系统要么使用语音增强或火车噪声模型删除这些背景声音。为了获得更好的模型可解释性和整体理解，我们旨在将自动音频字幕（AAC）的不断增长的领域和彻底研究的自动语音识别（ASR）汇总在一起。 AAC的目的是生成音频样本中内容的自然语言描述。我们提出了几种用于ASR和AAC任务的端到端联合建模的方法，并证明了它们比传统方法的优势，这些方法是独立建模的。评估我们提出的方法的一个主要障碍是缺乏带有语音抄录和音频字幕的标签音频数据集。因此，我们还通过将干净的语音Wall Street Journal语料库与从AudioCaps数据集中选择的多个背景噪声混合在一起来创建一个多任务数据集。与现有的最新ASR和AAC方法相比，我们还进行了广泛的实验评估，并显示了我们提出的方法的改进。

Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题