模块化杂交自动回旋换能器

论文标题

模块化杂交自动回旋换能器

Modular Hybrid Autoregressive Transducer

论文作者

Meng, Zhong, Chen, Tongzhou, Prabhavalkar, Rohit, Zhang, Yu, Wang, Gary, Audhkhasi, Kartik, Emond, Jesse, Strohman, Trevor, Ramabhadran, Bhuvana, Huang, W. Ronny, Variani, Ehsan, Huang, Yinghui, Moreno, Pedro J.

论文摘要

换能器模型的仅文本适应对于端到端语音识别仍然具有挑战性，因为传感器没有明确分开的声学模型（AM），语言模型（LM）或空白模型。在这项工作中，我们提出了一个模块化杂交自动回归换能器（MHAT），该透明剂（MHAT）具有结构分离的标签和空白解码器，以分别预测标签和空白分布，以及共享的声学编码器。编码器和标签解码器输出直接投影到AM和内部LM分数，然后添加到计算标签后期。我们以内部LM损失和帽子损失训练MHAT，以确保其内部LM成为独立的神经LM，可以有效地适应文本。此外，MHAT的文本适应性与基于LM减法的方法相比，LM融合更好。在Google的大规模生产数据上，使用100B句子的多域MHAT可实现相对降低的相对降低，最多可降低12.4％，而无需LM Fusion，而21.5％的相对减少，而训练有素的帽子则获得了LM Fusion。

Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题