有偏见的面部表达识别的Au-Aware Vision Transformer

论文标题

有偏见的面部表达识别的Au-Aware Vision Transformer

AU-Aware Vision Transformers for Biased Facial Expression Recognition

论文作者

Mao, Shuyi, Li, Xinpeng, Wu, Qingyang, Peng, Xiaojiang

论文摘要

研究证明，域偏差和标签偏差存在于不同的面部表达识别（FER）数据集中，因此很难通过添加其他数据集来提高特定数据集的性能。对于FER偏见问题，最近的研究主要集中于高级域适应算法的跨域问题。本文解决了另一个问题：如何通过利用跨域数据集来提高FER性能。与粗糙和有偏见的表达标签不同，面部作用单元（AU）是精细的，并且是由心理学研究提出的。在此激励的情况下，我们求助于不同的FER数据集的AU信息，以提高性能并做出如下贡献。首先，我们通过实验表明，多个FER数据集的幼稚联合培训对单个数据集的FER性能有害。我们进一步介绍了表达特异性的平均图像和AU余弦距离，以测量FER数据集偏差。这项新颖的测量结果与联合训练的实验降解显示了一致的结论。其次，我们提出了一个简单但概念上新的框架，Au-Au-Aware Vision Transformer（AU-VIT）。它通过使用AU或伪AU标签共同培训辅助数据集来提高单个数据集的性能。我们还发现，Au-Vit对现实世界的闭合是强大的。此外，我们首次证明了经过精心定义的VIT与先进的深卷积网络相当的性能。我们的Au-Vit在三个流行数据集上实现了最先进的性能，即RAF-DB的91.10％，intermatnet为65.59％，而Ferplus的效果为90.15％。代码和模型将很快发布。

Studies have proven that domain bias and label bias exist in different Facial Expression Recognition (FER) datasets, making it hard to improve the performance of a specific dataset by adding other datasets. For the FER bias issue, recent researches mainly focus on the cross-domain issue with advanced domain adaption algorithms. This paper addresses another problem: how to boost FER performance by leveraging cross-domain datasets. Unlike the coarse and biased expression label, the facial Action Unit (AU) is fine-grained and objective suggested by psychological studies. Motivated by this, we resort to the AU information of different FER datasets for performance boosting and make contributions as follows. First, we experimentally show that the naive joint training of multiple FER datasets is harmful to the FER performance of individual datasets. We further introduce expression-specific mean images and AU cosine distances to measure FER dataset bias. This novel measurement shows consistent conclusions with experimental degradation of joint training. Second, we propose a simple yet conceptually-new framework, AU-aware Vision Transformer (AU-ViT). It improves the performance of individual datasets by jointly training auxiliary datasets with AU or pseudo-AU labels. We also find that the AU-ViT is robust to real-world occlusions. Moreover, for the first time, we prove that a carefully-initialized ViT achieves comparable performance to advanced deep convolutional networks. Our AU-ViT achieves state-of-the-art performance on three popular datasets, namely 91.10% on RAF-DB, 65.59% on AffectNet, and 90.15% on FERPlus. The code and models will be released soon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题