统一和征服：语音特征表示如何影响多语言文本到语音（TTS）

论文标题

统一和征服：语音特征表示如何影响多语言文本到语音（TTS）

Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)

论文作者

Sanchez, Ariadna, Falai, Alessio, Zhang, Ziyao, Angelini, Orazio, Yanagisawa, Kayoko

论文摘要

多语言神经文本到语音（NTTS）系统的基本设计决策是如何表示模型中的输入语言特征。查看文献中各种各样的方法，出现了两个主要范式，统一和单独的表示。前者在跨语言中使用共享的语音令牌集，而后者使用每种语言的独特语音令牌。在本文中，我们进行了一项全面的研究，比较了多语言NTTS系统模型。我们的结果表明，统一的方法始终在自然和口音方面始终达到更好的跨语性综合。单独的表示形式往往比统一的代币更大的令牌，这可能会影响模型容量。因此，我们进行一项消融研究，以了解表示类型与令牌嵌入尺寸的相互作用。我们发现，两个范式之间的差异仅在一定阈值嵌入尺寸之上出现。这项研究提供了有力的证据，表明在构建多语言NTTS系统时，统一表示应该是首选范式。

An essential design decision for multilingual Neural Text-To-Speech (NTTS) systems is how to represent input linguistic features within the model. Looking at the wide variety of approaches in the literature, two main paradigms emerge, unified and separate representations. The former uses a shared set of phonetic tokens across languages, whereas the latter uses unique phonetic tokens for each language. In this paper, we conduct a comprehensive study comparing multilingual NTTS systems models trained with both representations. Our results reveal that the unified approach consistently achieves better cross-lingual synthesis with respect to both naturalness and accent. Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity. For this reason, we carry out an ablation study to understand the interaction of the representation type with the size of the token embedding. We find that the difference between the two paradigms only emerges above a certain threshold embedding size. This study provides strong evidence that unified representations should be the preferred paradigm when building multilingual NTTS systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题