强大的基于用户的业余声乐打击乐分类的深层嵌入

论文标题

强大的基于用户的业余声乐打击乐分类的深层嵌入

Deep Embeddings for Robust User-Based Amateur Vocal Percussion Classification

论文作者

Delgado, Alejandro, Demirel, Emir, Subramanian, Vinod, Saitis, Charalampos, Sandler, Mark

论文摘要

声带打击乐转录（VPT）与声带打击乐事件的自动检测和分类有关，使音乐创建者和制作人可以即时绘制鼓线。 VPT系统中的分类器算法从小型用户特定数据集中学习最好，这些数据集通常将建模限制为小输入功能集，以避免数据过度拟合。这项研究探讨了几种深入监督的学习策略，以获取用于业余声乐打击乐分类的信息特征集。我们评估了这些集合在常规声带分类任务上的性能，并将其与几种基线方法进行了比较，包括特征选择方法和语音识别引擎。这些提出的学习模型由几个标签集监督，其中包含来自四个不同级别的抽象级别的信息：仪器级，音节级，音素级别和Boxeme级别。结果表明，用音节级注释监督的卷积神经网络产生了最有用的分类嵌入，可以用作将分类器与分类器拟合的输入表示。最后，我们使用基于反向传播的显着性图来研究不同频谱图区域对特征学习的重要性。

Vocal Percussion Transcription (VPT) is concerned with the automatic detection and classification of vocal percussion sound events, allowing music creators and producers to sketch drum lines on the fly. Classifier algorithms in VPT systems learn best from small user-specific datasets, which usually restrict modelling to small input feature sets to avoid data overfitting. This study explores several deep supervised learning strategies to obtain informative feature sets for amateur vocal percussion classification. We evaluated the performance of these sets on regular vocal percussion classification tasks and compared them with several baseline approaches including feature selection methods and a speech recognition engine. These proposed learning models were supervised with several label sets containing information from four different levels of abstraction: instrument-level, syllable-level, phoneme-level, and boxeme-level. Results suggest that convolutional neural networks supervised with syllable-level annotations produced the most informative embeddings for classification, which can be used as input representations to fit classifiers with. Finally, we used back-propagation-based saliency maps to investigate the importance of different spectrogram regions for feature learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题