论文标题
REDPEN:不自然语音的区域和理由注销的数据集
RedPen: Region- and Reason-Annotated Dataset of Unnatural Speech
论文作者
论文摘要
即使在语音合成模型方面取得了最新进展,对此类模型的评估纯粹基于人类作为单个自然性评分,例如平均意见评分(MOS)。基于分数的度量没有提供任何有关语音哪些部分是不自然的信息,也没有提供人类法官认为他们不自然的原因。我们介绍了一个新颖的语音数据集Redpen,并在不自然的语音区域及其相应的原因进行了人类注释。雷德彭(Redpen)由180个综合演讲组成,由人群工人注释的不自然地区;然后将这些区域通过错误类型进行推理和分类,例如语音颤抖和背景噪声。我们发现,与模型驱动的不自然性预测相比,我们的数据集对不自然的语音区域显示出更好的解释。我们的分析还表明,每个模型都包含不同类型的错误类型。总结一下,我们的数据集成功地表明了各种错误区域和类型位于单个自然性分数下的可能性。我们认为,我们的数据集将阐明未来更容易解释的语音模型的评估和开发。我们的数据集将在接受后公开使用。
Even with recent advances in speech synthesis models, the evaluation of such models is based purely on human judgement as a single naturalness score, such as the Mean Opinion Score (MOS). The score-based metric does not give any further information about which parts of speech are unnatural or why human judges believe they are unnatural. We present a novel speech dataset, RedPen, with human annotations on unnatural speech regions and their corresponding reasons. RedPen consists of 180 synthesized speeches with unnatural regions annotated by crowd workers; These regions are then reasoned and categorized by error types, such as voice trembling and background noise. We find that our dataset shows a better explanation for unnatural speech regions than the model-driven unnaturalness prediction. Our analysis also shows that each model includes different types of error types. Summing up, our dataset successfully shows the possibility that various error regions and types lie under the single naturalness score. We believe that our dataset will shed light on the evaluation and development of more interpretable speech models in the future. Our dataset will be publicly available upon acceptance.