褶边：移动设备的非语义演讲

论文标题

褶边：移动设备的非语义演讲

FRILL: A Non-Semantic Speech Embedding for Mobile Devices

论文作者

Peplinski, Jacob, Shor, Joel, Joglekar, Sachin, Garrison, Jake, Patel, Shwetak

论文摘要

学习的语音表示形式可以大大提高具有有限标记数据的任务的性能。但是，由于其规模和复杂性，学到的表示形式在运行时性能可能是一个重要的瓶颈的移动设置中的实用性有限。在这项工作中，我们提出了一类轻巧的非语义语音嵌入模型，这些模型基于最近提出的颤音嵌入在移动设备上有效运行。我们将新颖的体系结构修改与现有的加速技术相结合，以创建嵌入式模型，这些模型足够快，可以在移动设备上实时运行，并在非语义语音任务的基准上表现出最小的性能退化。这样的模型（褶边）在像素1智能手机上的速度快32倍，而颤音的大小为40％，准确性的平均降低仅为2％。据我们所知，Frill是设计用于移动设备的最高质量的非语义嵌入。此外，我们证明这些表示形式对移动健康任务有用，例如非语音人类声音检测和掩盖语音检测。我们的模型和代码公开可用。

Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight non-semantic speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speed-up techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our models and code are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题