Neufa：基于神经网络的端到端强迫对准与双向注意机制

论文标题

Neufa：基于神经网络的端到端强迫对准与双向注意机制

NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism

论文作者

Li, Jingbei, Meng, Yi, Wu, Zhiyong, Meng, Helen, Tian, Qiao, Wang, Yuping, Wang, Yuxuan

论文摘要

尽管深度学习和端到端模型已被广泛使用，并且在自动语音识别（ASR）和文本到语音（TTS）综合方面表现出了优越性，但最新的强制对齐模型（FA）模型仍然基于隐藏的Markov模型（HMM）。 HMM对上下文信息的看法有限，并且使用长管道开发，从而导致错误积累和性能不令人满意。受到注意机制在ASR和TTS中捕获长期上下文信息和学习对齐方式的能力的启发，我们提出了一个基于神经网络的端到端强迫对准者，称为Neufa，其中一种新颖的双向注意机制起着重要的作用。 Neufa通过从所提出的双向注意机制中从共享的注意矩阵中学习双向对齐信息，将ASR和TTS任务的对齐学习在统一框架中整合在一起。从学习的注意力重量中提取对齐方式，并以多任务学习方式通过ASR，TTS和FA任务进行了优化。实验结果证明了我们提出的模型的有效性，与基于最先进的HMM模型相比，在单词级别的测试集的平均绝对误差从25.8 ms下降到单词级别的23.7 ms，在音素水平的17.0 ms降至15.7 ms。

Although deep learning and end-to-end models have been widely used and shown superiority in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis, state-of-the-art forced alignment (FA) models are still based on hidden Markov model (HMM). HMM has limited view of contextual information and is developed with long pipelines, leading to error accumulation and unsatisfactory performance. Inspired by the capability of attention mechanism in capturing long term contextual information and learning alignments in ASR and TTS, we propose a neural network based end-to-end forced aligner called NeuFA, in which a novel bidirectional attention mechanism plays an essential role. NeuFA integrates the alignment learning of both ASR and TTS tasks in a unified framework by learning bidirectional alignment information from a shared attention matrix in the proposed bidirectional attention mechanism. Alignments are extracted from the learnt attention weights and optimized by the ASR, TTS and FA tasks in a multi-task learning manner. Experimental results demonstrate the effectiveness of our proposed model, with mean absolute error on test set drops from 25.8 ms to 23.7 ms at word level, and from 17.0 ms to 15.7 ms at phoneme level compared with state-of-the-art HMM based model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题