论文标题
通过可区分的物理学模仿学习
Imitation Learning via Differentiable Physics
论文作者
论文摘要
现有的模仿学习(IL)方法,例如逆增强学习(IRL)通常具有双环培训过程,在学习奖励功能和政策之间交替,并且倾向于遭受较长的训练时间和较高的差异。在这项工作中,我们确定了可区分物理模拟器的好处,并提出了一种新的IL方法,即通过可区分的物理学(ILD)模仿学习,从而摆脱了双环设计,并在最终性能,收敛速度和稳定性方面取得了重大改善。提出的ILD将可区分的物理模拟器作为物理学将其纳入其策略学习的计算图中。它通过从参数化的策略中采样动作来展开动力学,只需最大程度地减少专家轨迹与代理轨迹之间的距离,然后通过时间物理运算符将梯度回到策略中。有了物理学之前,ILD政策不仅可以转移到看不见的环境规范中,而且可以在各种任务上产生更高的最终表现。此外,ILD自然形成了单环结构,从而显着提高了稳定性和训练速度。为了简化时间物理操作引起的复杂优化景观,ILD在优化过程中动态选择每个状态的学习目标。在我们的实验中,我们表明,ILD在各种连续控制任务中都超过了最先进的方法,只需要一个专家演示。此外,ILD可以应用于具有挑战性的可变形对象操纵任务,并可以推广到看不见的配置。
Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations.