提示语音识别的跨模式相互学习

论文标题

提示语音识别的跨模式相互学习

Cross-Modal Mutual Learning for Cued Speech Recognition

论文作者

Liu, Lei, Liu, Li

论文摘要

自动提示的语音识别（ACSR）为视觉通信提供了智能的人机接口，其中提示的语音（CS）系统利用唇部运动和手势来代码口语，以使听力受损的人使用。以前的ACSR方法通常将直接特征串联作为主要融合范式。但是，异步方式，即CS中的唇，手形和手部位置）可能会导致特征串联的干扰。为了应对这一挑战，我们提出了一个基于变压器的跨模式相互学习框架，以促使多模式相互作用。与香草自我注意力相比，我们的模型强制不同模态的特定于模态信息通过模态不变的代码簿，对每种模式的代币进行了语言表示。然后，共享的语言知识用于重新同步多模式序列。此外，我们为普通话中文建立了一种新颖的大规模多演讲者CS数据集。据我们所知，这是普通话中文的第一批有关ACSR的作品。对不同语言（即中文，法语和英语英语）进行了广泛的实验。结果表明，我们的模型通过很大的边缘表现出优异的识别性能。

Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages i.e., Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题