自动音频字幕和基于语言的音频检索

论文标题

自动音频字幕和基于语言的音频检索

Automated Audio Captioning and Language-Based Audio Retrieval

论文作者

Gomes, Clive, Park, Hyejin, Kollman, Patrick, Song, Yi, Houndayi, Iffanice, Shah, Ankit

论文摘要

该项目涉及参加DCASE 2022竞赛（任务6），该竞赛具有两个子任务：（1）自动化音频字幕和（2）基于语言的音频检索。第一个子任务涉及对音频样本的文本描述的生成，而第二个目标是在匹配给定描述的固定数据集中找到音频样本。对于两个子任务，都使用了布洛数据集。在BLEU1，BLEU2，BLEU3，ROGEL，流星，苹果酒，Spice和Spider评分上评估了这些模型，用于音频字幕，R1，R5，R10和MARP10分数用于音频检索。我们进行了一些实验，以修改这些任务的基线模型。我们用于自动音频字幕的最终架构接近基线性能，而我们的基于语言的音频检索模型已超过其对应方。

This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho dataset was used. The models were evaluated on BLEU1, BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have conducted a handful of experiments that modify the baseline models for these tasks. Our final architecture for Automated Audio Captioning is close to the baseline performance, while our model for Language-Based Audio Retrieval has surpassed its counterpart.

下载PDF全文

下载文献需遵守相关版权规定

论文标题