lighthubert：轻巧和可配置的语音表示学习一次，全部隐藏单位伯特

论文标题

lighthubert：轻巧和可配置的语音表示学习一次，全部隐藏单位伯特

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

论文作者

Wang, Rui, Bai, Qibing, Ao, Junyi, Zhou, Long, Xiong, Zhixiang, Wei, Zhihua, Zhang, Yu, Ko, Tom, Li, Haizhou

论文摘要

自我监督的语音表示学习在各种语音处理任务中显示出令人鼓舞的结果。但是，预培训的模型，例如休伯特是存储密集型变压器，限制了其在低资源设置下的应用程序范围。为此，我们建议通过修剪结构化参数自动找到所需的体系结构Lighthubert，这是一个曾经是变压器的压缩框架。更确切地说，我们创建了一个基于变压器的超级网，该超网嵌套着数千个重量共享子网，并设计了一个两阶段的蒸馏策略，以利用Hubert的上下文化潜在表示。关于自动语音识别（ASR）和出色基准的实验表明，拟议的Lighthubert可实现$ 10^9 $的架构，该体系结构涉及嵌入维度，注意力维度，头部数量，馈电网络比率和网络深度。 Lighthubert优于ASR的原始Hubert和Hubert尺寸的五项出色任务，在大多数任务中，与教师模型的表现相当，降低了29％的参数，并获得了$ 3.5 \ times $ $ 3.5 \ times $ compression $的压缩率，例如三个超级任务，例如，自动访问者验证，键入，键入较小的损失，以及一个精确的分类，以及一个精确的范围，并具有一定的精确性，并且是一个准确的范围，并且是一个精确的精确性。代码和预培训模型可在https://github.com/mechanicalsea/lighthubert上找到。

Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $10^9$ architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/mechanicalsea/lighthubert.

下载PDF全文

下载文献需遵守相关版权规定

论文标题