卷积语音识别具有音高和语音质量功能

论文标题

卷积语音识别具有音高和语音质量功能

Convolutional Speech Recognition with Pitch and Voice Quality Features

论文作者

Cámbara, Guillermo, Luque, Jordi, Farrús, Mireia

论文摘要

在这项工作中研究了为自动语音识别的最先进的CNN模型中添加音调和语音质量功能（例如抖动和微光）的影响。俯仰功能以前已用于改善经典的HMM和DNN基准，而抖动和微光参数已被证明对诸如扬声器或情感识别之类的任务很有用。据我们所知，这是将这种音高和语音质量功能与现代卷积体系结构相结合的第一项作品，分别显示出高达7％和3％的相对WER点的改进，分别为西班牙公共通用声音和LibrisPeech 100H数据集。特别是，我们的工作将这些特征与MEL频谱系数（MFSC）结合在一起，以训练具有封闭式线性单元（Conv Glus）的卷积架构。此类模型已显示出产生小单词错误率，同时非常适合用于在线流识别用例的并行处理。我们已经在Facebook的Wav2letter语音识别框架上增加了音调和语音质量功能，并向社区提供了这样的代码和食谱，以继续进行进一步的实验。此外，据我们所知，我们的西班牙普通语音食谱是Wav2letter的第一个公共西班牙配方。

The effects of adding pitch and voice quality features such as jitter and shimmer to a state-of-the-art CNN model for Automatic Speech Recognition are studied in this work. Pitch features have been previously used for improving classical HMM and DNN baselines, while jitter and shimmer parameters have proven to be useful for tasks like speaker or emotion recognition. Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 7% and 3% relative WER points, for the publicly available Spanish Common Voice and LibriSpeech 100h datasets, respectively. Particularly, our work combines these features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs). Such models have shown to yield small word error rates, while being very suitable for parallel processing for online streaming recognition use cases. We have added pitch and voice quality functionality to Facebook's wav2letter speech recognition framework, and we provide with such code and recipes to the community, to carry on with further experiments. Besides, to the best of our knowledge, our Spanish Common Voice recipe is the first public Spanish recipe for wav2letter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题