论文标题
基于对抗性综合的数据调节,用于代码开关的口语标识
Adversarial synthesis based data-augmentation for code-switched spoken language identification
论文作者
论文摘要
口语识别(LID)是自动语音识别(ASR)的重要子任务,用于对音频段中的语言进行分类。自动盖在多语言国家中起着有用的作用。在各个国家,由于多语言场景,在谈话过程中混合了两种或两种语言,因此确定一种语言变得很难。这种语音现象称为代码混合或代码转换。这种性质不仅在印度,而且在许多亚洲国家都遵循。很难找到这样的代码混合数据,这进一步降低了口语盖的功能。因此,这项工作主要通过数据扩展作为解决代码切换类的数据稀缺的解决方案来解决此问题。这项研究重点介绍用英语混合的形式语言代码。口语盖是在印地语上进行的,用英语编码。这项研究提出了基于生成的对抗网络(GAN)的数据增强技术,该技术使用MEL频谱图进行音频数据进行。 GAN已被证明在表示图像域中的真实数据分布方面是准确的。拟议的研究利用了语音分类,自动语音识别等语音领域中的gan的这些能力。培训了甘恩以生成少数族裔代码混合类的MEL频谱图,然后将其用于增强分类器的数据。与用作基线参考的卷积复发性神经网络(CRNN)分类器相比,使用GAN可以使未加权平均召回率的总体改善增加了3.5%。
Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition(ASR) that is used to classify the language(s) in an audio segment. Automatic LID plays an useful role in multilingual countries. In various countries, identifying a language becomes hard, due to the multilingual scenario where two or more than two languages are mixed together during conversation. Such phenomenon of speech is called as code-mixing or code-switching. This nature is followed not only in India but also in many Asian countries. Such code-mixed data is hard to find, which further reduces the capabilities of the spoken LID. Hence, this work primarily addresses this problem using data augmentation as a solution on the on the data scarcity of the code-switched class. This study focuses on Indic language code-mixed with English. Spoken LID is performed on Hindi, code-mixed with English. This research proposes Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data. GANs have already been proven to be accurate in representing the real data distribution in the image domain. Proposed research exploits these capabilities of GANs in speech domains such as speech classification, automatic speech recognition, etc. GANs are trained to generate Mel spectrograms of the minority code-mixed class which are then used to augment data for the classifier. Utilizing GANs give an overall improvement on Unweighted Average Recall by an amount of 3.5% as compared to a Convolutional Recurrent Neural Network (CRNN) classifier used as the baseline reference.