论文标题
使用MLP学习音频表示
Learning Audio Representations with MLPs
论文作者
论文摘要
在本文中,我们提出了一种有效的基于MLP的学习音频表示方法,即时间戳和场景级音频嵌入。我们使用由依次堆叠的门控MLP块组成的编码器,该模块接受2D MFCC作为输入。此外,我们还提供了一种简单的基于时间插值的算法,用于计算时间戳嵌入的场景级嵌入。通过我们的方法生成的音频表示形式在整体评估音频表示(HEAR)挑战的整体评估中,评估了在Neurips 2021竞争轨道上的整体评估。我们在演讲命令(完整),语音命令(5小时)和Mridingham Tonic基准测试中获得了第一名。此外,就模型参数的数量和计算嵌入所需的时间而言,我们的方法也是所有提交方法中资源最高的。
In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked gated MLP blocks, which accept 2D MFCCs as inputs. In addition, we also provide a simple temporal interpolation-based algorithm for computing scene-level embeddings from timestamp embeddings. The audio representations generated by our method are evaluated across a diverse set of benchmarks at the Holistic Evaluation of Audio Representations (HEAR) challenge, hosted at the NeurIPS 2021 competition track. We achieved first place on the Speech Commands (full), Speech Commands (5 hours), and the Mridingham Tonic benchmarks. Furthermore, our approach is also the most resource-efficient among all the submitted methods, in terms of both the number of model parameters and the time required to compute embeddings.