论文标题
实用的认知语音压缩
Practical cognitive speech compression
论文作者
论文摘要
本文提出了一种新的神经语音压缩方法,该方法在低比特率下运行,引入低潜伏期,在计算复杂性与当前移动设备兼容,并提供与标准的移动手机telephony codec相当的主观质量。最近提出的其他提议的神经声码器也具有低比特率操作的能力。但是,它们不会产生与标准编解码器相同的主观质量。另一方面,标准编解码器依赖于客观和短期指标,例如分段的信噪比,仅与感知弱相关。此外,标准编解码器在捕获语音属性(尤其是长期的语音属性)方面的效率低于无监督神经网络。提出的方法结合了一种认知编码编码器,该编码器将可解释的无监督分层表示形式与具有基于GAN的架构的多阶段解码器提取。我们观察到该方法对表示特征的量化非常强大。对哈佛句子的一部分进行了AB测试,这些句子通常用于评估标准的移动telephony编解码器。结果表明,所提出的方法在延迟,比特率和主观质量方面优于标准AMR-WB编解码器。
This paper presents a new neural speech compression method that is practical in the sense that it operates at low bitrate, introduces a low latency, is compatible in computational complexity with current mobile devices, and provides a subjective quality that is comparable to that of standard mobile-telephony codecs. Other recently proposed neural vocoders also have the ability to operate at low bitrate. However, they do not produce the same level of subjective quality as standard codecs. On the other hand, standard codecs rely on objective and short-term metrics such as the segmental signal-to-noise ratio that correlate only weakly with perception. Furthermore, standard codecs are less efficient than unsupervised neural networks at capturing speech attributes, especially long-term ones. The proposed method combines a cognitive-coding encoder that extracts an interpretable unsupervised hierarchical representation with a multi stage decoder that has a GAN-based architecture. We observe that this method is very robust to the quantization of representation features. An AB test was conducted on a subset of the Harvard sentences that are commonly used to evaluate standard mobile-telephony codecs. The results show that the proposed method outperforms the standard AMR-WB codec in terms of delay, bitrate and subjective quality.