论文标题

RNN-T模型无法推广到室外音频:原因和解决方案

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

论文作者

Chiu, Chung-Cheng, Narayanan, Arun, Han, Wei, Prabhavalkar, Rohit, Zhang, Yu, Jaitly, Navdeep, Pang, Ruoming, Sainath, Tara N., Nguyen, Patrick, Cao, Liangliang, Wu, Yonghui

论文摘要

近年来,全面的端到端方法在几个具有挑战性的自动语音识别(ASR)任务方面获得了最先进的结果。但是,大多数现有的作品都集中在构建从同一域中绘制火车和测试数据的ASR模型。这会导致对不匹配的域的概括特征差:例如,在较长的话语评估时,在短段中接受训练的端到端模型的表现较差。在这项工作中,我们分析了基于流媒体和非流式复发神经网络传感器(RNN-T)的端到端模型的概括属性,以识别对概括性能产生负面影响的模型组件。我们提出了两种解决方案:在训练过程中结合多个正则化技术,并使用动态重叠推断。在长格式YouTube测试集中,当使用较短的数据段训练NONSTREAM的RNN-T模型时,建议的组合将单词错误率(WER)从22.3%提高到14.8%;当流媒体RNN-T模型在简短的搜索查询中训练时,YouTube设置的提出的技术从67.0%提高到25.3%。最后,当在LibrisPeech上接受训练时,我们发现动态重叠推论将YouTube上的WER从99.8%提高到33.0%。

In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference. On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14.8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67.0% to 25.3%. Finally, when trained on Librispeech, we find that dynamic overlapping inference improves WER on YouTube from 99.8% to 33.0%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源