CITISEN：基于深度学习的语音信号处理移动应用程序

论文标题

CITISEN：基于深度学习的语音信号处理移动应用程序

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

论文作者

Chen, Yu-Wen, Hung, Kuo-Hsuan, Li, You-Jin, Kang, Alexander Chao-Fu, Lai, Ya-Hsin, Liu, Kai-Chun, Fu, Szu-Wei, Wang, Syu-Siang, Tsao, Yu

论文摘要

这项研究提出了一种基于深度学习的语音信号处理移动应用程序，称为CITISEN。 Citisen提供了三个功能：语音增强（SE），模型适应（MA）和背景噪声转换（BNC），使Citisen可以用作使用和评估SE模型的平台，并灵活地扩展模型以解决各种噪声环境和用户。对于SE，从云服务器下载的审计的SE模型用于从用户提供的即时或保存的录音中有效地减少噪声组件。为了遇到看不见的噪声或扬声器环境，使用MA功能来促进CITISEN。在嘈杂的环境上录制了一些录制的音频样本，并用于调整服务器上验证的SE模型。最后，对于BNC而言，Citisen首先通过SE模型去除背景噪声，然后将处理的语音与新的背景噪声混合在一起。新颖的BNC功能可以在特定条件下评估SE的性能，涵盖人们的曲目并提供娱乐。实验结果证实了SE，MA和BNC功能的有效性。与嘈杂的语音信号相比，在短期客观可理解性（Stoi）和语音质量（PESQ）方面，增强的语音信号分别获得了约6 \％和33％的改进。使用MA，可以分别将Stoi和Pesq进一步提高6 \％和11 \％。最后，BNC实验结果表明，从嘈杂和无声的背景转换的语音信号具有紧密的场景识别精度，并且在声学场景分类模型中具有类似的嵌入。因此，提出的BNC可以有效地转换语音信号的背景噪声，并在干净的语音信号不可用时成为数据增强方法。

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题