论文标题
从未标记的3D环境中学习以进行视觉和语言导航
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
论文作者
论文摘要
在视觉和语言导航(VLN)中,需要按照自然语言说明在现实的3D环境中导航。现有VLN方法的一个主要瓶颈是缺乏足够的培训数据,从而导致对看不见的环境的概括不令人满意。虽然通常会手动收集VLN数据,但这种方法很昂贵,并且可以防止可扩展性。在这项工作中,我们通过建议从HM3D自动创建一个来自900个未标记的3D建筑物的大规模VLN数据集来解决数据稀缺问题。我们为每个建筑物生成一个导航图,并通过交叉视图一致性从2D传输对象预测,从2D传输伪3D对象标签。然后,我们使用伪对象标签来微调一个预处理的语言模型,以提示减轻教学生成中的跨模式差距。在导航环境和说明方面,我们由此产生的HM3D-AUTOVLN数据集比现有VLN数据集大的数量级。我们通过实验表明,HM3D-AUTOVLN显着提高了所得VLN模型的概括能力。在SPL指标上,我们的方法分别在Reverie和DataSet的看不见的验证分裂上提高了7.1%和8.1%。
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively.