视频蒙版转换用于高质量视频实例细分

论文标题

视频蒙版转换用于高质量视频实例细分

Video Mask Transfiner for High-Quality Video Instance Segmentation

论文作者

Ke, Lei, Ding, Henghui, Danelljan, Martin, Tai, Yu-Wing, Tang, Chi-Keung, Yu, Fisher

论文摘要

尽管视频实例细分（VIS）已经取得了迅速的进步，但当前的方法很难预测具有准确边界细节的高质量面具。此外，预测的分割通常会随着时间的流逝而波动，表明时间一致性提示被忽略或不完全使用。在本文中，我们着手解决这些问题，目的是实现VIS的高度详细且更具时间稳定的面具预测。我们首先提出了视频蒙版转换方法（VMT）方法，得益于高效的视频变压器结构，能够利用细粒度的高分辨率功能。我们的VMT检测和组在视频段中每个曲目的稀疏易于错误的时空区域稀疏，然后使用本地和实例级别的提示对其进行完善。其次，我们确定流行的YouTube-VIS数据集的粗边界注释构成了主要限制因素。因此，根据我们的VMT体系结构，我们通过迭代培训和自我纠正设计了一种自动注释细化方法。为了基于VIS的高质量掩码预测，我们介绍了HQ-YTVIS数据集，该数据集由手动重新注销的测试集和我们的自动完善培训数据组成。我们将VMT与HQ-YTVI的最新最新方法以及YouTube-Vis，Ovis和BDD100K MOTS基准进行了比较。实验结果清楚地证明了我们方法通过捕获精确的细节来分割复杂和动态对象的功效和有效性。

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

下载PDF全文

下载文献需遵守相关版权规定

论文标题