流式的联合语音识别和反射检测

论文标题

流式的联合语音识别和反射检测

Streaming Joint Speech Recognition and Disfluency Detection

论文作者

Futami, Hayato, Tsunoo, Emiru, Shibata, Kentaro, Kashiwagi, Yosuke, Okuda, Takao, Arora, Siddhant, Watanabe, Shinji

论文摘要

反应检测主要是用管道方法解决的，作为语音识别后处理。在这项研究中，我们提出了基于变压器的编码器模型，该模型共同解决了以流方式工作的语音识别和不足检测。与管道方法相比，联合模型可以利用声学信息，从而使识别误差鲁棒性检测并提供非语言线索。此外，关节建模会导致低延迟和轻量级推断。我们研究了两个用于流差异检测的联合模型变体：富含转录的模型和一个多任务模型。富含笔录的模型在文本上接受了特殊标签的培训，指示了不发变部分的起点和终点。但是，它在延迟和标准语言模型适应方面存在问题，这是由其他反射标签引起的。我们提出了一个多任务模型来解决此类问题，该问题在变压器解码器上具有两个输出层。一个用于语音识别，另一个用于探测。它的建模是在当前识别的令牌上具有其他令牌依赖性机制的条件。我们表明，所提出的联合模型在自发日语的总和板和语料库上的精度和延迟都优于基于BERT的管道方法。

Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.

下载PDF全文

下载文献需遵守相关版权规定

论文标题