论文标题
一项针对基于神经网络修剪的迭代微调的设计紧凑的视听唤醒单词斑点系统的研究
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning
论文作者
论文摘要
在嘈杂的条件下,由于信号传输的环境干扰,基于音频的唤醒单词斑点(WWS)在嘈杂的条件下具有挑战性。在本文中,我们通过利用视觉信息来减轻降解,研究设计紧凑的视听WWS系统。具体而言,为了使用视觉信息,我们首先将检测到的LIPS编码为具有Mobilenet的固定尺寸向量,并将它们与声学特征相连,然后是WWS的Fusion网络。但是,基于神经网络的视听模型需要较大的足迹和较高的计算复杂性。为了满足应用要求,我们分别以迭代的微调方式(LTH-IF)通过彩票假设引入了神经网络修剪策略,分别向单模式和多模式模型引入了神经网络修剪策略。在家庭电视场景中,在我们的内部语料库中进行了视听WWS的测试,拟议的视听系统在不同的嘈杂条件下对单模式(仅听众或仅视频)系统进行了显着的性能改进。此外,LTH-如果修剪可以很大程度上减少网络参数和计算,而不会降解WWS性能,从而为电视唤醒方案提供了潜在的产品解决方案。
Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario.