论文标题
VIT-HGR:高密度表面EMG信号的基于视觉变压器的手势识别
ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals
论文作者
论文摘要
最近,对使用深度学习模型(DL)模型的应用引起了极大的兴趣,该模型使用表面肌电图(SEMG)信号自主执行手势识别。但是,DL模型的主要设计目的是应用于稀疏的SEMG信号。此外,由于其复杂的结构,通常我们面临着记忆限制。需要大量的培训时间和大量的培训样本,并且;有必要求助于数据扩展和/或转移学习。在本文中,我们首次(据我们所知),我们研究并设计了一个基于视觉变压器(VIT)的结构,以从高密度(HD-SEMG)信号中执行手势识别。从直觉上讲,我们利用了变压器结构在解决不同复杂问题方面的最新突破性作用,以及其通过其注意力机制采用更多输入并行化的潜力。提出的基于视觉变压器的手势识别(VIT-HGR)框架可以克服上述训练时间问题,并可以准确地从头开始对大量的手势进行分类,而无需任何数据增强和/或转移学习。使用最近发行的HD-SEMG数据集评估了所提出的VIT-HGR框架的效率,该数据集由65个等距手势组成。我们使用64个样本(31.25 ms)窗口尺寸的实验平均测试精度为84.62 +/- 3.07%,其中只有78、210个参数数。所提出的基于VIT的VIT-HGR框架的紧凑结构(即具有显着减少的可训练参数的数量)显示出其实用的假体控制应用的巨大潜力。
Recently, there has been a surge of significant interest on application of Deep Learning (DL) models to autonomously perform hand gesture recognition using surface Electromyogram (sEMG) signals. DL models are, however, mainly designed to be applied on sparse sEMG signals. Furthermore, due to their complex structure, typically, we are faced with memory constraints; require large training times and a large number of training samples, and; there is the need to resort to data augmentation and/or transfer learning. In this paper, for the first time (to the best of our knowledge), we investigate and design a Vision Transformer (ViT) based architecture to perform hand gesture recognition from High Density (HD-sEMG) signals. Intuitively speaking, we capitalize on the recent breakthrough role of the transformer architecture in tackling different complex problems together with its potential for employing more input parallelization via its attention mechanism. The proposed Vision Transformer-based Hand Gesture Recognition (ViT-HGR) framework can overcome the aforementioned training time problems and can accurately classify a large number of hand gestures from scratch without any need for data augmentation and/or transfer learning. The efficiency of the proposed ViT-HGR framework is evaluated using a recently-released HD-sEMG dataset consisting of 65 isometric hand gestures. Our experiments with 64-sample (31.25 ms) window size yield average test accuracy of 84.62 +/- 3.07%, where only 78, 210 number of parameters is utilized. The compact structure of the proposed ViT-based ViT-HGR framework (i.e., having significantly reduced number of trainable parameters) shows great potentials for its practical application for prosthetic control.