论文标题
DeepTalk:扬声器识别和语音综合编码的声带样式
DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis
论文作者
论文摘要
自动说话者识别算法通常使用短期光谱特征来表征语音音频,这些特征编码语音生产的生理和解剖方面。这种算法并不能完全利用行为语音特征中存在的说话者依赖性特征。在这项工作中,我们提出了一个名为DeepTalk的韵律编码网络,用于直接从原始音频数据中提取声带样式功能。 DeepTalk方法的表现优于多个挑战性数据集的几个最先进的说话者识别系统。通过将DeepTalk与最先进的生理语音功能的扬声器识别系统相结合,可以进一步提高说话者的识别性能。我们还将DeepTalk集成到当前的最新语音合成器中,以产生合成语音。对合成语音的详细分析表明,DeepTalk捕获了F0轮廓对于人声风格建模必不可少的。此外,在说话者识别的背景下,基于DeepTalk的综合语音几乎与真实语音没有区别。
Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.