论文标题

部分可观测时空混沌系统的无模型预测

A Language Model With Million Context Length For Raw Audio

论文作者

Verma, Prateek

论文摘要

对音频信号的长期依赖性进行建模是一个特别具有挑战性的问题,因为即使是小型尺度的产量,也以十万个样本为单位。随着变形金刚最近的出现,神经体系结构擅长于更长的时间尺度建模依赖性,但它们受到二次限制的限制来扩展它们。我们提出了一种生成的自动回归体系结构,该体系结构可以在相当大的上下文中对音频波形进行建模,超过500,000个样本。我们的工作适应了通过CNN前端学习潜在表示,然后使用变压器编码器,完全训练的端到端学习这些表示的依赖性来学习时间依赖性:从而允许其认为适合下一个示例,从而允许学习表示形式。与以前的工作比较不同的时间量表以显示改进的作品不同,我们使用标准数据集,并具有相同数量的参数/上下文来显示改进。与其他方法相比,我们在标准数据集中实现了最先进的性能,例如WaveNet,Sashmi和Sample-RNN,用于对长期结构进行建模。这项工作为该领域提供了非常令人兴奋的方向,鉴于上下文建模的改进,可以通过使用数十亿/万亿个参数来缩放更多数据,并有可能更好地结果。

Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at modeling dependencies over longer time scales, but they suffered from quadratic constraints to scale them. We propose a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders, fully trained end-to-end: thereby allowing to learn representations as it deems fit for the next sample. Unlike previous works that compared different time scales to show improvement, we use a standard dataset, with the same number of parameters/context to show improvements. We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN on a standard dataset for modeling long-term structure. This work gives very exciting direction for the field, given improvements in context modeling that can be scaled with more data, as well as potentially better results by using billions/trillions of parameters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源