域中的域自适应手关键点和像素定位

论文标题

域中的域自适应手关键点和像素定位

Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

论文作者

Ohkawa, Takehiko, Li, Yu-Jhe, Fu, Qichen, Furuta, Ryosuke, Kitani, Kris M., Sato, Yoichi

论文摘要

我们的目标是在新的成像条件（例如户外）下仅在非常不同的条件下拍摄的图像（例如室内）时标记了图像时，我们的目标是在新成像条件下（例如室外）进行回归的手动键盘和分割像素级的手罩的性能。在现实世界中，重要的是，在各种成像条件下，针对这两个任务进行训练的模型都必须进行。但是，它们被现有标记的手数据集所涵盖的变化是有限的。因此，有必要调整在标记的图像（源）上训练的模型，以使其具有看不见的成像条件的未标记图像（目标）。尽管已经为这两个任务开发了自我训练域的适应方法（即以自我监督的方式从未标记的目标图像中学习），但当目标图像的预测嘈杂时，它们的训练可能会降低性能。为了避免这种情况，至关重要的是，在自我训练过程中，为嘈杂的预测分配了较低的重要性（置信度）。在本文中，我们建议利用两个预测的差异来估计目标图像对这两个任务的信心。这些预测来自两个单独的网络，它们的差异有助于确定嘈杂的预测。为了将我们提出的信心估算纳入自我训练中，我们提出了一个教师学生的框架，在该框架中，两个网络（教师）为自我培训提供了对网络（学生）的监督，并通过知识蒸馏从学生那里学到了教师。我们的实验表明，在具有不同照明，握住对象，背景和摄像机观点的适应设置中，其优于最先进的方法。与最新的对抗适应方法相比，我们的方法在HO3D上的多任务得分提高了4％。我们还验证了我们在室外成像条件的eGo4d，以自我为中心的视频的方法。

We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题