Audiolm：音频生成的语言建模方法

论文标题

Audiolm：音频生成的语言建模方法

AudioLM: a Language Modeling Approach to Audio Generation

论文作者

Borsos, Zalán, Marinier, Raphaël, Vincent, Damien, Kharitonov, Eugene, Pietquin, Olivier, Sharifi, Matt, Roblek, Dominik, Teboul, Olivier, Grangier, David, Tagliasacchi, Marco, Zeghidour, Neil

论文摘要

我们介绍了Audiolm，这是一个具有长期一致性的高质量音频产生框架。 Audiolm将输入音频映射到一系列离散令牌，并将音频生成作为此表示空间中的语言建模任务。我们展示了现有的音频引物如何在重建质量和长期结构之间提供不同的权衡，我们提出了一个混合代币化计划，以实现这两个目标。也就是说，我们利用在音频中预先训练的蒙版语言模型的离散激活来捕获长期结构和神经音频编解码器产生的离散代码以实现高质量的合成。通过对原始音频波形的大型语料库进行培训，Audiolm学会了在简短的提示下产生自然和连贯的连续性。当接受演讲训练时，没有任何笔录或注释，Audiolm会在句法和语义上产生可见的语音连续性，同时还为看不见的说话者维护说话者的身份和韵律。此外，我们证明了我们的方法如何通过产生连贯的钢琴音乐连续性来超越语音，尽管受过训练而没有任何象征性的音乐代表。

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

下载PDF全文

下载文献需遵守相关版权规定

论文标题