论文标题
适合我的是不适合您的:用于通过多任务学习接地相对方向的数据集
What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning
论文作者
论文摘要
理解空间关系对于智能代理人在物理世界中采取行动和交流至关重要。相对方向是空间关系,描述了目标对象在参考对象的内在方向方面的相对位置。接地相对方向比接地绝对方向更加困难,因为它不仅需要模型来检测图像中的对象并基于此信息识别空间关系,而且还需要识别对象的方向并将此信息集成到推理过程中。我们研究了通过端到端神经网络扎根相对方向的具有挑战性的问题。为此,我们提供了Grid-3D,这是一个新颖的数据集,具有相对方向并补充了现有的视觉问题答案(VQA)数据集(例如CLEVR),仅涉及绝对方向。我们还为数据集提供了两个已建立的端到端VQA模型的基准。实验评估表明,当数据集中的问题模拟接地相对方向的必要子任务时,在相对方向上回答问题是可行的。我们发现这些子任务是按顺序学习的,该顺序反映了处理相对方向的直观管道的步骤。
Understanding spatial relations is essential for intelligent agents to act and communicate in the physical world. Relative directions are spatial relations that describe the relative positions of target objects with regard to the intrinsic orientation of reference objects. Grounding relative directions is more difficult than grounding absolute directions because it not only requires a model to detect objects in the image and to identify spatial relation based on this information, but it also needs to recognize the orientation of objects and integrate this information into the reasoning process. We investigate the challenging problem of grounding relative directions with end-to-end neural networks. To this end, we provide GRiD-3D, a novel dataset that features relative directions and complements existing visual question answering (VQA) datasets, such as CLEVR, that involve only absolute directions. We also provide baselines for the dataset with two established end-to-end VQA models. Experimental evaluations show that answering questions on relative directions is feasible when questions in the dataset simulate the necessary subtasks for grounding relative directions. We discover that those subtasks are learned in an order that reflects the steps of an intuitive pipeline for processing relative directions.