用于一次性视觉模仿的变压器

论文标题

用于一次性视觉模仿的变压器

Transformers for One-Shot Visual Imitation

论文作者

Dasari, Sudeep, Gupta, Abhinav

论文摘要

人类能够通过推断他人的意图并利用过去的经验来实现相同的最终目标，从而无视他人。换句话说，我们可以从原始视频中解析复杂的语义知识，并有效地将其转化为具体的运动控制。是否可以给机器人同样的功能？机器人模仿学习的先前研究创造了可以从专家运营商那里获得多种技能的代理。但是，在测试时间内扩展这些技术以与一个积极的例子一起使用仍然是一个悬而未决的挑战。除了控制外，难度还来自示威者和机器人域之间的不匹配。例如，物体可以放置在不同的位置（例如，每个房屋中的厨房布局都不同）。此外，演示可能来自具有不同形态和外观（例如人）的药物，因此不可用一对一的动作对应关系。本文研究了使机器人过去经验在这些领域差距中部分弥合这些领域差距的技术。神经网络经过培训，可以模仿另一个代理的上下文视频，以模仿地面真相机器人动作，并且在测试时间内使用新视频的提示时，必须概括以看不见的任务实例。我们假设我们的策略表示既必须是上下文驱动的，又是动态意识，才能执行这些任务。这些假设使用变压器注意机制和自我保护的逆动力学损失烘烤到神经网络中。最后，我们通过实验确定我们的方法在任务成功率上与先前基线相比，在一系列单发操作任务中，任务成功率提高了$ \ sim 2 $ x。

Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate that into concrete motor control. Is it possible to give a robot this same capability? Prior research in robot imitation learning has created agents which can acquire diverse skills from expert human operators. However, expanding these techniques to work with a single positive example during test time is still an open challenge. Apart from control, the difficulty stems from mismatches between the demonstrator and robot domains. For example, objects may be placed in different locations (e.g. kitchen layouts are different in every house). Additionally, the demonstration may come from an agent with different morphology and physical appearance (e.g. human), so one-to-one action correspondences are not available. This paper investigates techniques which allow robots to partially bridge these domain gaps, using their past experience. A neural network is trained to mimic ground truth robot actions given context video from another agent, and must generalize to unseen task instances when prompted with new videos during test time. We hypothesize that our policy representations must be both context driven and dynamics aware in order to perform these tasks. These assumptions are baked into the neural network using the Transformers attention mechanism and a self-supervised inverse dynamics loss. Finally, we experimentally determine that our method accomplishes a $\sim 2$x improvement in terms of task success rate over prior baselines in a suite of one-shot manipulation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题