论文标题
基于语言模型的情感预测方法
Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems
论文作者
论文摘要
本文提出了一种有效的情感文本到语音(TTS)系统,该系统具有预先训练的语言模型(LM)的情感预测方法。与需要辅助输入(例如手动定义的情绪类别)的传统系统不同,我们的系统直接从输入文本中估算与情绪相关的属性。具体而言,我们利用生成的预训练的变压器(GPT)-3共同预测情绪类别及其在表示情绪粗糙和精细特性方面的强度。然后,将这些属性组合在情感嵌入空间中,并用作产生输出语音信号的TTS模型的条件特征。因此,所提出的系统只能从文本中产生情感语音,而无需任何辅助输入。此外,由于GPT-3使连续句子之间的情感背景可以有效地处理段落级别的情感语音。
This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method can effectively handle the paragraph-level generation of emotional speech.