基于扬声器提取的深度临时波束成形，以依赖目标依赖的语音分离

论文标题

基于扬声器提取的深度临时波束成形，以依赖目标依赖的语音分离

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation

论文作者

Yang, Ziye, Guan, Shanzheng, Zhang, Xiao-Lei

论文摘要

最近，对具有深度学习的临时麦克风阵列的研究引起了很多关注，尤其是在语音增强和分离方面。由于临时麦克风阵列可能涵盖如此大的面积，以至于多个扬声器可以分开并独立交谈，因此旨在从混合语音中提取目标扬声器的目标依赖性语音分离对于提取和追踪临时阵列中的特定扬声器很重要。但是，尚未探索此技术。在本文中，我们提出了基于说话者提取的深度临时波束成形，据我们所知，这是基于临时麦克风阵列和深度学习的目标依赖性语音分离的第一项工作。该算法包含三个组件。首先，我们提出了一个基于扬声器提取的监督渠道选择框架，其中估计的目标语音的估计言语级SNR被用作频道选择的基础。其次，我们将所选的通道应用于基于深度学习的MVDR算法，其中将单渠道扬声器提取算法应用于每个选定的通道以估计目标语音的掩码。我们在WSJ0-Adhoc语料库上进行了广泛的实验。实验结果证明了该方法的有效性。

Recently, the research on ad-hoc microphone arrays with deep learning has drawn much attention, especially in speech enhancement and separation. Because an ad-hoc microphone array may cover such a large area that multiple speakers may locate far apart and talk independently, target-dependent speech separation, which aims to extract a target speaker from a mixed speech, is important for extracting and tracing a specific speaker in the ad-hoc array. However, this technique has not been explored yet. In this paper, we propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning. The algorithm contains three components. First, we propose a supervised channel selection framework based on speaker extraction, where the estimated utterance-level SNRs of the target speech are used as the basis for the channel selection. Second, we apply the selected channels to a deep learning based MVDR algorithm, where a single-channel speaker extraction algorithm is applied to each selected channel for estimating the mask of the target speech. We conducted an extensive experiment on a WSJ0-adhoc corpus. Experimental results demonstrate the effectiveness of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题