论文标题
扬声器验证和语音触发检测的多任务学习
Multi-task Learning for Speaker Verification and Voice Trigger Detection
论文作者
论文摘要
自动语音转录和说话者识别通常被视为单独的任务,即使它们是相互依存的。在这项研究中,我们研究了培训单个网络以共同执行这两个任务。我们以监督的多任务学习设置进行训练,在该设置中,网络的语音转录分支进行了培训,以最大程度地减少语音连接主义时间分类(CTC)损失,而网络的扬声器识别分支则经过培训,可以用正确的标签标签输入序列。我们提出了一项大规模的经验研究,其中使用数千小时的每个任务标记培训数据对模型进行了训练。我们在语音触发检测任务上评估了网络的语音转录分支,而在说话者识别分支上进行了扬声器验证任务进行评估。结果表明,网络能够在其学习的表示中编码语音\ emph {and}扬声器信息,同时产生至少与每个任务的基线模型一样好,并且参数数与独立模型相同。
Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker. We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data for each task. We evaluate the speech transcription branch of the network on a voice trigger detection task while the speaker recognition branch is evaluated on a speaker verification task. Results demonstrate that the network is able to encode both phonetic \emph{and} speaker information in its learnt representations while yielding accuracies at least as good as the baseline models for each task, with the same number of parameters as the independent models.