在机器人操纵中增强可解释性和互动性：一种神经符号方法

论文标题

在机器人操纵中增强可解释性和互动性：一种神经符号方法

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

论文作者

Tziafas, Georgios, Kasaei, Hamidreza

论文摘要

在本文中，我们介绍了一种神经束结构，用于与机器人操纵耦合语言引导的视觉推理。非专家用户可以使用不受限制的自然语言提示机器人，提供引用表达（参考），问题（VQA）或掌握动作指令。该系统通过使用共享的原始技能库以任务不足的方式解决所有案例。每个原始性都处理一个独立的子任务，例如有关视觉属性，空间关系理解，逻辑和枚举以及手臂控制的推理。语言解析器将输入查询映射到由上下文组成的可执行程序。虽然某些原语是纯粹的符号操作（例如计数），但另一些原语是可训练的神经功能（例如，视觉接地），因此将离散符号方法的可解释性和系统概括益处与深网的可伸缩性和表示能力结合在一起。我们在仿真环境中生成了桌面场景的3D视觉和语言合成数据集，以训练我们的方法并在合成和真实世界场景中进行广泛的评估。结果在准确性，样本效率和鲁棒性方面展示了我们方法对用户词汇的好处，同时可以通过很少的视觉微调转移到现实世界的场景。最后，我们将方法与机器人框架集成在一起，并演示如何在模拟和真实的机器人中作为交互式对象挑选任务的可解释解决方案。我们在https://gtziafas.github.io/neurosymbolic-manipulation中提供数据集。

In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in https://gtziafas.github.io/neurosymbolic-manipulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题