部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

论文作者

Ma, Yecheng Jason, Sodhani, Shagun, Jayaraman, Dinesh, Bastani, Osbert, Kumar, Vikash, Zhang, Amy

论文摘要

奖励和表示学习是从感官观察中学习一组不断扩展的机器人操纵技巧的两个长期挑战。鉴于内域，特定于任务的机器人数据的固有成本和稀缺性，从大型，多样化的离线视频中学习，已经成为获得一个普遍有用的视觉表示控制的有前途的途径；但是，如何将这些人类视频用于通用奖励学习仍然是一个悬而未决的问题。我们介绍了$ \ textbf {v} $ alue-$ \ textbf {i} $ mplitic $ \ textbf {p} $ re-training（vip），这是一种自我培训的预训练的预训练的预培训的视觉表示，能够生成未看见的机器人任务的密集奖励和平滑的奖励函数。 VIP演员表代表从人类视频中学习是一个离线目标条件的强化学习问题，并得出了一个自制的双目标条件条件的价值功能目标，该目标不取决于行动，从而可以对未标记的人类视频进行预训练。从理论上讲，可以将VIP理解为一个新颖的隐式时间对比目标，该目标产生了时间平滑的嵌入，从而使值函数能够通过嵌入距离隐式定义，然后可以将其用于为任何指定的下游任务的目标图像构建奖励。 VIP的冷冻代表性在大规模的EGO4D人类视频中接受了培训，并且无需进行任何微调，而无需对任务特定的数据进行微调，可以为一组广泛的模拟和$ \ textbf {Real-bot} $任务提供密集的视觉奖励，从而实现了基于奖励的多样性控制方法，并实现了所有先前的先前先前的预先表现。值得注意的是，VIP可以启用简单的，$ \ textbf {fig-shot} $ offline rl rl在一组现实世界的机器人任务中，只有20个轨迹。

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题