论文标题
使用基于卷积变压器的光谱图分析的合成语音检测
Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis
论文作者
论文摘要
由于虚拟助手的普遍性,用于生成和修改语音信号的易于使用的工具以及远程工作实践,因此综合语音很常见。综合语音也可以用于邪恶目的,包括创建一个声称的语音信号,并将其归因于不说信号内容的人。我们需要检测语音信号是否合成的方法。在本文中,我们以紧凑的卷积变压器(CCT)的形式分析语音信号,以进行综合语音检测。 CCT利用卷积层将电感偏差和共享权重引入网络,从而使变压器体系结构在较少的数据样本中表现良好。 CCT使用注意机制将来自分析信号的所有部分的信息合并。经过真正的人类语音信号和合成的人类语音信号的培训,我们证明了我们的CCT方法成功区分了真正的和合成的语音信号。
Synthesized speech is common today due to the prevalence of virtual assistants, easy-to-use tools for generating and modifying speech signals, and remote work practices. Synthesized speech can also be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. We need methods to detect if a speech signal is synthesized. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer (CCT) for synthesized speech detection. A CCT utilizes a convolutional layer that introduces inductive biases and shared weights into a network, allowing a transformer architecture to perform well with fewer data samples used for training. The CCT uses an attention mechanism to incorporate information from all parts of a signal under analysis. Trained on both genuine human voice signals and synthesized human voice signals, we demonstrate that our CCT approach successfully differentiates between genuine and synthesized speech signals.