论文标题
不是彼此制作的 - 基于视听性的不和谐的深泡沫检测和本地化
Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization
论文作者
论文摘要
我们建议根据音频和视觉方式之间的差异来检测DeepFake视频,称为模态不和谐得分(MDS)。我们假设操纵两种方式将导致两种模态之间的骚扰,例如,唇部同步,非自然的面部和唇部运动等。MDS计算为视频中音频和视觉片段之间的异小率分数的汇总。在音频和视觉频道方面学习了判别性特征,以块的方式,采用了单个模式的跨透明拷贝损失,以及模拟模型间相似性的对比损失。 DFDC和DeepFake-Timit数据集进行的广泛实验表明,我们的方法的表现优于最新技术。我们还展示了时间伪造的本地化,并展示了我们的技术如何识别受操纵的视频段。
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.