关于视觉上下文在丰富音乐表述中的作用

论文标题

关于视觉上下文在丰富音乐表述中的作用

On the Role of Visual Context in Enriching Music Representations

论文作者

Avramidis, Kleanthis, Stewart, Shanti, Narayanan, Shrikanth

论文摘要

人类对音乐的看法和经验高度依赖于上下文。上下文可变性有助于我们解释和与音乐互动的差异，从而挑战了可靠的模型以进行信息检索。从各种来源结合多模式上下文为建模这种变异性提供了一种有希望的方法。电影和音乐视频等媒体上发表的音乐提供了丰富的多模式背景，可调节人类的基本体验。但是，这种上下文建模是没有反应的，因为它需要大量的多模式数据以及相关的注释。自我监督的学习可以通过自动提取不同方式之间的丰富高级对应关系来帮助解决这些挑战，从而减轻对大规模细粒注释的需求。在这项研究中，我们提出了VCMR-视频调整的音乐表现形式，这是一个对比的学习框架，可从音频和随附的音乐视频中学习音乐表现。上下文视觉信息增强了音乐音频的表示，如音乐标记的下游任务所评估。实验结果表明，所提出的框架可以对音频表示形式造成添加鲁棒性，并指示音乐元素在多大程度上受视觉上下文的影响或确定。

Human perception and experience of music is highly context-dependent. Contextual variability contributes to differences in how we interpret and interact with music, challenging the design of robust models for information retrieval. Incorporating multimodal context from diverse sources provides a promising approach toward modeling this variability. Music presented in media such as movies and music videos provide rich multimodal context that modulates underlying human experiences. However, such context modeling is underexplored, as it requires large amounts of multimodal data along with relevant annotations. Self-supervised learning can help address these challenges by automatically extracting rich, high-level correspondences between different modalities, hence alleviating the need for fine-grained annotations at scale. In this study, we propose VCMR -- Video-Conditioned Music Representations, a contrastive learning framework that learns music representations from audio and the accompanying music videos. The contextual visual information enhances representations of music audio, as evaluated on the downstream task of music tagging. Experimental results show that the proposed framework can contribute additive robustness to audio representations and indicates to what extent musical elements are affected or determined by visual context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题