论文标题
用自我监督的前端和线性变压器唱歌节拍跟踪
Singing Beat Tracking With Self-supervised Front-end and Linear Transformers
论文作者
论文摘要
在没有音乐伴奏的情况下跟踪歌声的节奏可以在音乐制作,自动歌曲安排和社交媒体互动中找到许多应用。它的主要挑战是缺乏强大的节奏和谐波模式,这对于音乐的节奏分析很重要。即使对于人类听众,这也可能是一项具有挑战性的任务。结果,现有的音乐节拍跟踪系统无法在歌声上提供令人满意的性能。在本文中,我们提出了Singing Beat Tracking作为一项新任务,并提出了解决此任务的第一种方法。我们的方法通过使用预先训练的自我监督的WAVLM和Distilhubert语音表示作为前端来利用歌声的语义信息,并使用自发编码器层来预测节拍。为了训练和测试系统,我们使用源分离获得了分离的歌声及其节拍注释,并在完整的歌曲上进行了跟踪,然后进行了手动校正。 GTZAN数据集的741个分离的人声轨道上的实验表明,在节拍跟踪准确性方面,提出的系统的表现优于几种最先进的音乐节拍跟踪方法。消融研究还证实了预先训练的自我监督语音表示的优势,而不是通用光谱特征。
Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing music beat tracking systems fail to deliver satisfactory performance on singing voices. In this paper, we propose singing beat tracking as a novel task, and propose the first approach to solving this task. Our approach leverages semantic information of singing voices by employing pre-trained self-supervised WavLM and DistilHuBERT speech representations as the front-end and uses a self-attention encoder layer to predict beats. To train and test the system, we obtain separated singing voices and their beat annotations using source separation and beat tracking on complete songs, followed by manual corrections. Experiments on the 741 separated vocal tracks of the GTZAN dataset show that the proposed system outperforms several state-of-the-art music beat tracking methods by a large margin in terms of beat tracking accuracy. Ablation studies also confirm the advantages of pre-trained self-supervised speech representations over generic spectral features.