DeepTalk：扬声器识别和语音综合编码的声带样式

论文标题

DeepTalk：扬声器识别和语音综合编码的声带样式

DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis

论文作者

Chowdhury, Anurag, Ross, Arun, David, Prabu

论文摘要

自动说话者识别算法通常使用短期光谱特征来表征语音音频，这些特征编码语音生产的生理和解剖方面。这种算法并不能完全利用行为语音特征中存在的说话者依赖性特征。在这项工作中，我们提出了一个名为DeepTalk的韵律编码网络，用于直接从原始音频数据中提取声带样式功能。 DeepTalk方法的表现优于多个挑战性数据集的几个最先进的说话者识别系统。通过将DeepTalk与最先进的生理语音功能的扬声器识别系统相结合，可以进一步提高说话者的识别性能。我们还将DeepTalk集成到当前的最新语音合成器中，以产生合成语音。对合成语音的详细分析表明，DeepTalk捕获了F0轮廓对于人声风格建模必不可少的。此外，在说话者识别的背景下，基于DeepTalk的综合语音几乎与真实语音没有区别。

Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题