个人VAD 2.0：优化个人语音活动检测，以识别设备的语音识别

论文标题

个人VAD 2.0：优化个人语音活动检测，以识别设备的语音识别

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

论文作者

Ding, Shaojin, Rikhye, Rajeev, Liang, Qiao, He, Yanzhang, Wang, Quan, Narayanan, Arun, O'Malley, Tom, McGraw, Ian

论文摘要

近年来，在设备上的演讲识别（ASR）的个性化已经爆炸性增长，这在很大程度上是由于个人助理功能在移动设备和智能家居扬声器上的普及。在这项工作中，我们提出了个人VAD 2.0，这是一种个性化的语音活动探测器，可检测目标扬声器的语音活动，这是流媒体上的ASR系统中的一部分。尽管以前的概念证明研究已经验证了个人VAD的有效性，但在生产中使用该模型之前，仍然存在一些关键的挑战：首先，在入学和无人列的情况下，质量必须令人满意。其次，它应该以流媒体方式运行；最后，型号的大小应足够小，以适合有限的延迟和CPU/内存预算。为了满足多方面的要求，我们提出了一系列新颖的设计：1）高级扬声器嵌入调制方法； 2）一种新的培训范式，以概括为无入学条件； 3）用于延迟和资源限制的体系结构和运行时优化。对现实语音识别系统的广泛实验证明了我们提出的方法的最新性能。

Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题