凝集：婴儿姿势估计的深度聚集视觉变压器

论文标题

凝集：婴儿姿势估计的深度聚集视觉变压器

AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation

论文作者

Cao, Xu, Li, Xiaoye, Ma, Liya, Huang, Yi, Feng, Xuan, Chen, Zening, Zeng, Hongwu, Cao, Jianguo

论文摘要

对新生儿的运动和姿势评估使经验丰富的儿科医生可以预测神经发育障碍，从而可以早期干预相关疾病。但是，大多数用于人类姿势估计方法的最新AI方法都集中在成年人上，缺乏公开基准的婴儿姿势估计。在本文中，我们通过提出婴儿姿势数据集和深度聚合视觉变压器来填补这一空白，以进行人类姿势估计，该姿势估计引入了一个快速训练的完整变压器框架，而无需使用卷积操作在早期阶段提取功能。它将变压器 + MLP推广到特征图内的高分辨率深层聚集，从而在不同视力级别之间实现信息融合。我们在可可姿势数据集上预先培训杂交，并将其应用于新发布的大规模婴儿姿势估计数据集。结果表明，凝集可以有效地学习不同分辨率之间的多尺度特征，并显着提高婴儿姿势估计的性能。我们表明，在婴儿姿势估计数据集中，凝集优于混合模型hrformer和tokenpose。此外，在可可瓣姿势估计上，我们的杂交优于0.8 ap的凝聚力。我们的代码可在github.com/szar-lab/aggpose上找到。

Movement and pose assessment of newborns lets experienced pediatricians predict neurodevelopmental disorders, allowing early intervention for related diseases. However, most of the newest AI approaches for human pose estimation methods focus on adults, lacking publicly benchmark for infant pose estimation. In this paper, we fill this gap by proposing infant pose dataset and Deep Aggregation Vision Transformer for human pose estimation, which introduces a fast trained full transformer framework without using convolution operations to extract features in the early stages. It generalizes Transformer + MLP to high-resolution deep layer aggregation within feature maps, thus enabling information fusion between different vision levels. We pre-train AggPose on COCO pose dataset and apply it on our newly released large-scale infant pose estimation dataset. The results show that AggPose could effectively learn the multi-scale features among different resolutions and significantly improve the performance of infant pose estimation. We show that AggPose outperforms hybrid model HRFormer and TokenPose in the infant pose estimation dataset. Moreover, our AggPose outperforms HRFormer by 0.8 AP on COCO val pose estimation on average. Our code is available at github.com/SZAR-LAB/AggPose.

下载PDF全文

下载文献需遵守相关版权规定

论文标题