口头注意力的口头焦点系统从学习中学习

论文标题

口头注意力的口头焦点系统从学习中学习

Verbal Focus-of-Attention System for Learning-from-Observation

论文作者

Wake, Naoki, Yanokura, Iori, Sasabuchi, Kazuhiro, Ikeuchi, Katsushi

论文摘要

从学习中的学习（LFO）框架旨在将人类的演示映射到机器人中，以减少编程工作。为此，LFO系统将人类演示编码为机器人的一系列执行单元，这些执行单元称为任务模型。尽管以前的研究提出了成功的任务模型编码器，但关于如何在具有时空噪声的场景中指导任务模型编码器的讨论很少，例如混乱的物体或不相关的人体运动。受到指导观察者视觉关注的口头指示功能的启发，我们提出了口头注意力集中（FOA）系统（即时空过滤器）来指导任务模型编码器。对于对象操作，系统首先识别目标对象的名称及其属性，从口头指令中识别。该信息用作局部外观的FOA过滤器，以限制演示中目标对象存在的区域。然后，系统检测到过滤区域中发生的抓握和释放时间。这些定时是何时限制对象操纵时期的何时孔。最后，任务模型编码器通过使用FOA过滤器来识别任务模型。我们通过将其与现有的动作本地化网络进行比较来证明言语FOA在减弱时空噪声方面的鲁棒性。这项研究的贡献如下：（1）提出言语FOA的LFO，（2）设计算法以计算从口头输入中计算FOA滤波器的算法，（3）以证明语言FOA在将动作与正式视觉系统进行比较中进行定位的作用。

The learning-from-observation (LfO) framework aims to map human demonstrations to a robot to reduce programming effort. To this end, an LfO system encodes a human demonstration into a series of execution units for a robot, which are referred to as task models. Although previous research has proposed successful task-model encoders, there has been little discussion on how to guide a task-model encoder in a scene with spatio-temporal noises, such as cluttered objects or unrelated human body movements. Inspired by the function of verbal instructions guiding an observer's visual attention, we propose a verbal focus-of-attention (FoA) system (i.e., spatio-temporal filters) to guide a task-model encoder. For object manipulation, the system first recognizes the name of a target object and its attributes from verbal instructions. The information serves as a where-to-look FoA filter to confine the areas in which the target object existed in the demonstration. The system then detects the timings of grasp and release that occurred in the filtered areas. The timings serve as a when-to-look FoA filter to confine the period of object manipulation. Finally, a task-model encoder recognizes the task models by employing FoA filters. We demonstrate the robustness of the verbal FoA in attenuating spatio-temporal noises by comparing it with an existing action localization network. The contributions of this study are as follows: (1) to propose a verbal FoA for LfO, (2) to design an algorithm to calculate FoA filters from verbal input, and (3) to demonstrate the effectiveness of a verbal FoA in localizing an action by comparing it with a state-of-the-art vision system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题