混合与匹配：一项关于训练语料库组成的经验研究，用于多语言文本到语音（TTS）

论文标题

混合与匹配：一项关于训练语料库组成的经验研究，用于多语言文本到语音（TTS）

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)

论文作者

Zhang, Ziyao, Falai, Alessio, Sanchez, Ariadna, Angelini, Orazio, Yanagisawa, Kayoko

论文摘要

培训仅使用单语言语料库的多语言神经文本到语音（NTTS）模型已成为构建基于语音克隆的Polyglot NTTS系统的流行方式。为了训练这些模型，必须了解培训语料库的组成如何影响多语言语音综合的质量。在这种情况下，通常会听到诸如“包含更多西班牙数据有助于我的意大利综合的问题，鉴于两种语言的亲密关系？”。不幸的是，我们发现有关该主题缺乏完整性的现有文献。在目前的工作中，我们进行了一项广泛的消融研究，旨在了解培训语料库的各种因素（例如语言家族隶属关系，性别组成和演讲者的数量）如何有助于多面化综合的质量。我们的发现包括在大多数情况下首选女性扬声器数据的观察结果，并且在培训语料库中拥有更多来自目标语言的说话者并不总是有益的。此处的发现对于数据采购和语料库构建过程提供了信息。

Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems. In order to train these models, it is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis. In this context, it is common to hear questions such as "Would including more Spanish data help my Italian synthesis, given the closeness of both languages?". Unfortunately, we found existing literature on the topic lacking in completeness in this regard. In the present work, we conduct an extensive ablation study aimed at understanding how various factors of the training corpora, such as language family affiliation, gender composition, and the number of speakers, contribute to the quality of Polyglot synthesis. Our findings include the observation that female speaker data are preferred in most scenarios, and that it is not always beneficial to have more speakers from the target language variant in the training corpus. The findings herein are informative for the process of data procurement and corpora building.

下载PDF全文

下载文献需遵守相关版权规定

论文标题