论文标题
使用分解的多尺度时空网络的面部表达分析
Facial Expression Analysis Using Decomposed Multiscale Spatiotemporal Networks
论文作者
论文摘要
基于视频的面部表情分析已越来越多地应用于推断个体的健康状态,例如抑郁和疼痛。在现有方法中,由多尺度时空处理结构组成的深度学习模型显示出强大的编码面部动力学潜力。但是,这样的模型具有很高的计算复杂性,因此很难部署这些解决方案。为了解决这个问题,我们引入了一种新技术来分解多尺度时空特征的提取。特别是,构成了一个称为分解的多尺度时空网络(DMSN)的构件结构以及三种变体:DMSN-A,DMSN-B和DMSN-C块。 DMSN-A块通过分析多个时间范围的时空特征来生成多尺度表示,而DMSN-B块在多个范围内分析时空特征,而DMSN-C块分析在多个空间尺寸下分析时空特征。使用这些变体,我们设计了DMSN体系结构,该体系结构具有探索各种多尺度时空特征的能力,有利于对不同面部行为的适应。我们在具有挑战性的数据集上进行的广泛实验表明,DMSN-C块对抑郁症检测有效,而DMSN-A块有效地估算了疼痛。结果还表明,我们的DMSN体系结构为表达式提供了一种具有成本效益的解决方案,其表达式从随着时间的流逝,如抑郁症检测,到更大的变化,如疼痛估计。
Video-based analysis of facial expressions has been increasingly applied to infer health states of individuals, such as depression and pain. Among the existing approaches, deep learning models composed of structures for multiscale spatiotemporal processing have shown strong potential for encoding facial dynamics. However, such models have high computational complexity, making for a difficult deployment of these solutions. To address this issue, we introduce a new technique to decompose the extraction of multiscale spatiotemporal features. Particularly, a building block structure called Decomposed Multiscale Spatiotemporal Network (DMSN) is presented along with three variants: DMSN-A, DMSN-B, and DMSN-C blocks. The DMSN-A block generates multiscale representations by analyzing spatiotemporal features at multiple temporal ranges, while the DMSN-B block analyzes spatiotemporal features at multiple ranges, and the DMSN-C block analyzes spatiotemporal features at multiple spatial sizes. Using these variants, we design our DMSN architecture which has the ability to explore a variety of multiscale spatiotemporal features, favoring the adaptation to different facial behaviors. Our extensive experiments on challenging datasets show that the DMSN-C block is effective for depression detection, whereas the DMSN-A block is efficient for pain estimation. Results also indicate that our DMSN architecture provides a cost-effective solution for expressions that range from fewer facial variations over time, as in depression detection, to greater variations, as in pain estimation.