关于语义分割的单眼深度训练的生存能力

论文标题

关于语义分割的单眼深度训练的生存能力

On the Viability of Monocular Depth Pre-training for Semantic Segmentation

论文作者

Lao, Dong, Yang, Fengyu, Wang, Daniel, Park, Hyoungseob, Lu, Samuel, Wong, Alex, Soatto, Stefano

论文摘要

在几何任务上进行预训练对于下游转移到语义任务是否可行的问题很重要，原因有两个，一个是实际的，另一个科学。如果答案是积极的，我们可能能够大大降低人类注释者的培训成本和偏见。如果答案是负面的，它可能会阐明体现在语言出现和其他认知功能在进化史上的作用。为了以当前手段测试的方式来构建问题，我们会在几何任务上预先训练一个模型，并测试是否可以将其用于在分配符号（标签）的“对象”概念中启动“对象”的概念。我们选择单眼深度预测作为几何任务，而语义分割作为下游的语义任务，并通过探索深度预训练和语义训练的深度预训练和语义微调的不同形式的监督，训练管道和数据源来设计经验测试的集合。我们发现，单眼深度是一种可行的语义分割预训练形式，它通过对公共基础的改进来验证。根据发现，我们提出了改进背后的几种可能机制，包括它们与数据集大小，分辨率，体系结构，内/外域源数据的关系，并通过广泛的消融研究验证它们。我们还发现，乍一看似乎与深度预测一样好，因为它优化了相同的光度重投影误差，它的效率大大降低，因为它并没有明确地旨在推断场景的潜在结构，而是临时邻近图像的原始现象学。

The question of whether pre-training on geometric tasks is viable for downstream transfer to semantic tasks is important for two reasons, one practical and the other scientific. If the answer is positive, we may be able to reduce pre-training cost and bias from human annotators significantly. If the answer is negative, it may shed light on the role of embodiment in the emergence of language and other cognitive functions in evolutionary history. To frame the question in a way that is testable with current means, we pre-train a model on a geometric task, and test whether that can be used to prime a notion of 'object' that enables inference of semantics as soon as symbols (labels) are assigned. We choose monocular depth prediction as the geometric task, and semantic segmentation as the downstream semantic task, and design a collection of empirical tests by exploring different forms of supervision, training pipelines, and data sources for both depth pre-training and semantic fine-tuning. We find that monocular depth is a viable form of pre-training for semantic segmentation, validated by improvements over common baselines. Based on the findings, we propose several possible mechanisms behind the improvements, including their relation to dataset size, resolution, architecture, in/out-of-domain source data, and validate them through a wide range of ablation studies. We also find that optical flow, which at first glance may seem as good as depth prediction since it optimizes the same photometric reprojection error, is considerably less effective, as it does not explicitly aim to infer the latent structure of the scene, but rather the raw phenomenology of temporally adjacent images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题