通过转移学习以及音频和文本增强来改善基于自然语言的音频检索

论文标题

通过转移学习以及音频和文本增强来改善基于自然语言的音频检索

Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

论文作者

Primus, Paul, Widmer, Gerhard

论文摘要

在许多深度学习的应用领域中，缺乏大型标记的数据集仍然是一个重大挑战。研究人员和从业人员通常求助于转移学习和数据增强以减轻此问题。我们通过自然语言查询（DCASE 2022挑战的任务6B）在音频检索的背景下研究这些策略。我们提出的系统使用预训练的嵌入模型将记录和文本描述投影到共享的音频捕获空间中，其中不同模式的相关示例接近。我们对音频和文本输入采用各种数据增强技术，并通过基于顺序的模型优化系统地调整其相应的超参数。我们的结果表明，使用的增强策略会减少过度拟合并提高检索性能。

The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We employ various data augmentation techniques on audio and text inputs and systematically tune their corresponding hyperparameters with sequential model-based optimization. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题