自动转录的WAV2VEC 2.0的转移学习

论文标题

自动转录的WAV2VEC 2.0的转移学习

Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription

论文作者

Ou, Longshen, Gu, Xiangming, Wang, Ye

论文摘要

由于大规模数据集的出现和自我监督的学习（SSL）范式，自动语音识别（ASR）在近年来取得了显着发展。但是，作为其在歌唱领域中的对应问题，自动抒情转录（ALT）的发展遭受了有限的数据和Sung歌词的清晰度降解。为了填补ALT和ASR之间的性能差距，我们试图利用语音和唱歌之间的相似性。在这项工作中，我们提出了一种基于转移学习的ALT解决方案，该解决方案通过将SSL ASR模型的WAV2VEC 2.0调整为唱歌域，利用这些相似性。我们通过探索不同转移起点的影响来最大程度地提高转移学习的有效性。我们通过将原始CTC模型扩展到混合CTC/注意模型来进一步提高性能。我们的方法超过了以前的方法，在各种ALT基准数据集上的差距很大。进一步的实验表明，即使是一小部分培训数据，我们的方法仍然可以达到竞争性能。

Automatic speech recognition (ASR) has progressed significantly in recent years due to the emergence of large-scale datasets and the self-supervised learning (SSL) paradigm. However, as its counterpart problem in the singing domain, the development of automatic lyric transcription (ALT) suffers from limited data and degraded intelligibility of sung lyrics. To fill in the performance gap between ALT and ASR, we attempt to exploit the similarities between speech and singing. In this work, we propose a transfer-learning-based ALT solution that takes advantage of these similarities by adapting wav2vec 2.0, an SSL ASR model, to the singing domain. We maximize the effectiveness of transfer learning by exploring the influence of different transfer starting points. We further enhance the performance by extending the original CTC model to a hybrid CTC/attention model. Our method surpasses previous approaches by a large margin on various ALT benchmark datasets. Further experiments show that, with even a tiny proportion of training data, our method still achieves competitive performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题