转移视频手语识别的跨域知识

论文标题

转移视频手语识别的跨域知识

Transferring Cross-domain Knowledge for Video Sign Language Recognition

论文作者

Li, Dongxu, Yu, Xin, Xu, Chenchen, Petersson, Lars, Li, Hongdong

论文摘要

单词级手语识别（WSLR）是手语解释的基本任务。它要求模型从视频中识别孤立的符号单词。但是，注释WSLR数据需要专家知识，从而限制了WSLR数据集采集。相反，Internet上有大量的字幕新闻视频。由于这些视频没有单词级的注释，并且显示出孤立标志的较大域间隙，因此它们不能直接用于训练WSLR模型。我们观察到，尽管存在较大的领域差距，孤立和新闻迹象共享相同的视觉概念，例如手势和身体运动。在这一观察过程中，我们提出了一种新颖的方法，该方法通过将副标题的新闻标志的知识传递给他们，从而学习域不变的视觉概念并施肥WSLR模型。为此，我们使用基本WSLR模型提取新闻标志，然后设计一个经过新闻和隔离标志训练的分类器，以使这两个域功能变得粗略。为了学习每个类中的域不变特征并抑制特定于域的特征，我们的方法进一步求助于外部内存，以存储对齐新闻标志的类质心。然后，我们根据学习的描述符设计时间关注，以提高识别性能。标准WSLR数据集的实验结果表明，我们的方法的表现高于先前的最新方法。我们还证明了我们方法对自动本地化标志的有效性，从[email protected]获得28.1。

Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for [email protected].

下载PDF全文

下载文献需遵守相关版权规定

论文标题