半监督面部动作单位强度估计与对比度学习

论文标题

半监督面部动作单位强度估计与对比度学习

Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning

论文作者

Sanchez, Enrique, Bulat, Adrian, Zaganidis, Anestis, Tzimiropoulos, Georgios

论文摘要

本文解决了估计几乎没有标记图像的面部动作单元强度的挑战性问题。与以前的作品相反，我们的方法不需要手动选择关键帧，并以$ 2 \％的注释框架产生最先进的结果，即\％$ $ textit {随机选择}。为此，我们提出了一种半监督的学习方法，其中在两个阶段中学习了特征提取器和时间模块的时空模型。第一阶段使用未标记视频的数据集学习基于对比度学习的面部行为动态的强烈时空表示。据我们所知，我们是第一个以无监督方式建模面部行为的框架建立该框架的人。第二阶段使用另一个随机选择的标记帧的数据集来训练我们时空模型之上的回归器，以估计AU强度。我们表明，尽管仅针对网络的输出进行了稀疏和随机选择的标记帧的反向传播，但由于第一阶段的无人监督的预训练，我们的模型可以有效地训练以准确估算AU强度。我们通过实验验证我们的方法在使用DISFA和BP4D数据集的随机选择数据的$ 2 \％$时，我们的方法的表现优于现有方法，而无需仔细选择标记的框架，在先前方法中仍然需要耗时的任务。

This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2\%$ of annotated frames, which are \textit{randomly chosen}. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as $2\%$ of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题