本地信息协助无注意的解码器用于音频字幕

论文标题

本地信息协助无注意的解码器用于音频字幕

Local Information Assisted Attention-free Decoder for Audio Captioning

论文作者

Xiao, Feiyang, Guan, Jian, Lan, Haiyan, Zhu, Qiaoxi, Wang, Wenwu

论文摘要

自动音频字幕旨在用自然语言用字幕描述音频数据。现有的方法通常采用编码器 - 编码器结构，其中基于注意力的解码器（例如，变压器解码器）被广泛使用并实现最新的性能。尽管该方法通过自我发挥机制有效地捕获了音频数据中的全局信息，但由于其在音频信号中捕获本地信息的限制，它可能会在短时间内忽略事件，从而导致对字幕的预测不准确。为了解决这个问题，我们提出了一种使用预告片的音频神经网络（PANN）作为编码器和本地信息辅助无注意变压器（Localaft）作为解码器的方法。我们方法的新颖性在于局部自由解码器的提案，该解码器允许在保留全局信息的同时捕获音频信号中的本地信息。这使得可以捕获不同持续时间的事件，包括短持续时间，以产生更精确的字幕。实验表明，我们的方法在DCASE 2021挑战任务6中的最新方法与标准的基于注意力的解码器用于字幕生成。

Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an audio signal, leading to inaccurate prediction of captions. To address this issue, we propose a method using the pretrained audio neural networks (PANNs) as the encoder and local information assisted attention-free Transformer (LocalAFT) as the decoder. The novelty of our method is in the proposal of the LocalAFT decoder, which allows local information within an audio signal to be captured while retaining the global information. This enables the events of different duration, including short duration, to be captured for more precise caption generation. Experiments show that our method outperforms the state-of-the-art methods in Task 6 of the DCASE 2021 Challenge with the standard attention-based decoder for caption generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题