论文标题
自我监督语音模型对音频表示的功效
The Efficacy of Self-Supervised Speech Models for Audio Representations
论文作者
论文摘要
自我监督的学习(SSL)语音模型可以用作提取有意义的语音表示的强大上游模型,在语音表示学习中取得了前所未有的成功。但是,它们对非语音数据集的有效性相对较少。在这项工作中,我们提出了一个合奏框架,结合了集合技术,将SSL语音模型的嵌入融合。对语音和非语音音频数据集进行了广泛的实验,以研究我们的集合方法的表示能力及其单个组成模型。进行消融研究以评估不同集合技术的性能,例如特征平均和串联。所有实验均在2021年神经期间进行,听到挑战作为竞争官员提供的标准评估管道。结果证明了SSL语音模型在各种非语音任务上的强大能力,而我们还注意到它们无法处理细粒度的音乐任务,例如音调分类和注释发作检测。此外,由于我们提出的框架通常超过了最先进的SSL语音/音频模型,因此功能合奏在产生更多的整体表示方面具有巨大的潜力,并且与其他数据集相比,在Hear Challenge中,各种数据集的性能都出色。我们的代码可在https://github.com/tony101105/hear-2021-neurips-challenge- NTU-GURA获得。
Self-supervised learning (SSL) speech models, which can serve as powerful upstream models to extract meaningful speech representations, have achieved unprecedented success in speech representation learning. However, their effectiveness on non-speech datasets is relatively less explored. In this work, we propose an ensemble framework, with a combination of ensemble techniques, to fuse SSL speech models' embeddings. Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model. Ablation studies are carried out to evaluate the performances of different ensemble techniques, such as feature averaging and concatenation. All experiments are conducted during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results demonstrate SSL speech models' strong abilities on various non-speech tasks, while we also note that they fail to deal with fine-grained music tasks, such as pitch classification and note onset detection. In addition, feature ensemble is shown to have great potential on producing more holistic representations, as our proposed framework generally surpasses state-of-the-art SSL speech/audio models and has superior performance on various datasets compared with other teams in HEAR Challenge. Our code is available at https://github.com/tony10101105/HEAR-2021-NeurIPS-Challenge -- NTU-GURA.