TARGF：学习目标梯度字段以重新排列对象，而无需明确的目标规范

论文标题

TARGF：学习目标梯度字段以重新排列对象，而无需明确的目标规范

TarGF: Learning Target Gradient Field to Rearrange Objects without Explicit Goal Specification

论文作者

Wu, Mingdong, Zhong, Fangwei, Xia, Yulong, Dong, Hao

论文摘要

对象重排是将对象从初始状态移动到目标状态。在这里，我们专注于对象重排的更实用的设置，即，从改组的布局到不明确目标规范的规范目标分布进行重新排列的对象。但是，对于AI代理商而言，它仍然具有挑战性，因为很难描述奖励工程或收集专家轨迹作为示范的目标分布（目标规范）。因此，直接采用强化学习或模仿学习算法来解决任务是不可行的。本文旨在仅通过目标分布而不是手工制作的奖励功能来搜索策略。我们采用分数匹配目标来训练目标梯度场（TARGF），指示每个对象的方向增加目标分布的可能性。对于对象重排，可以通过两种方式使用TARGF：1）对于基于模型的计划，我们可以将目标梯度带入使用分布式路径计划者的参考控制和输出操作； 2）对于无模型的增强学习，TARGF不仅用于估算奖励的可能性变化，而且还提供了剩余政策学习中建议的行动。球和房间重排的实验结果表明，我们的方法在终端状态的质量，控制过程的效率和可扩展性方面显着优于最新方法。

Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题