使用基于卷积变压器的光谱图分析的合成语音检测

论文标题

使用基于卷积变压器的光谱图分析的合成语音检测

Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis

论文作者

Bartusiak, Emily R., Delp, Edward J.

论文摘要

由于虚拟助手的普遍性，用于生成和修改语音信号的易于使用的工具以及远程工作实践，因此综合语音很常见。综合语音也可以用于邪恶目的，包括创建一个声称的语音信号，并将其归因于不说信号内容的人。我们需要检测语音信号是否合成的方法。在本文中，我们以紧凑的卷积变压器（CCT）的形式分析语音信号，以进行综合语音检测。 CCT利用卷积层将电感偏差和共享权重引入网络，从而使变压器体系结构在较少的数据样本中表现良好。 CCT使用注意机制将来自分析信号的所有部分的信息合并。经过真正的人类语音信号和合成的人类语音信号的培训，我们证明了我们的CCT方法成功区分了真正的和合成的语音信号。

Synthesized speech is common today due to the prevalence of virtual assistants, easy-to-use tools for generating and modifying speech signals, and remote work practices. Synthesized speech can also be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. We need methods to detect if a speech signal is synthesized. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer (CCT) for synthesized speech detection. A CCT utilizes a convolutional layer that introduces inductive biases and shared weights into a network, allowing a transformer architecture to perform well with fewer data samples used for training. The CCT uses an attention mechanism to incorporate information from all parts of a signal under analysis. Trained on both genuine human voice signals and synthesized human voice signals, we demonstrate that our CCT approach successfully differentiates between genuine and synthesized speech signals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题