从语言中学习不变的语义表示，以进行可扩展的政策概括

论文标题

从语言中学习不变的语义表示，以进行可扩展的政策概括

Learning Invariable Semantical Representation from Language for Extensible Policy Generalization

论文作者

Li, Yihan, Ren, Jinsheng, Xu, Tianrun, Zhang, Tianren, Gao, Haichuan, Chen, Feng

论文摘要

最近，将自然语言指令纳入加强学习（RL）以学习语义上有意义的表述和寄养概括引起了许多关注。但是，语言指令中的语义信息通常与特定于任务的状态信息纠缠在一起，这会阻碍学习语义不变和可重复使用的表示。在本文中，我们提出了一种学习这种称为元素随机化的表示的方法，该方法使用一组具有随机元素的环境（例如拓扑结构或纹理），从指令中提取与任务相关但环境不合命斯液的语义，但相同的语言指令。从理论上讲，我们证明了通过随机化学习语义不变表示的可行性。实际上，我们相应地制定了策略的层次结构，在该策略中，高级政策旨在通过提出子目标作为语义不变的表示来调节目标条件的低级政策的行为。关于挑战长马的实验表明，（1）我们的低水平政策可靠地概括为针对环境变化的任务；（2）我们的分层政策在看不见的新任务中表现出可扩展的概括，可以分解为几个可解决的子任务；（3）通过将语言轨迹存储为简洁的政策表示，代理可以以一次性的方式完成任务，即一旦达到了一个成功的轨迹。

Recently, incorporating natural language instructions into reinforcement learning (RL) to learn semantically meaningful representations and foster generalization has caught many concerns. However, the semantical information in language instructions is usually entangled with task-specific state information, which hampers the learning of semantically invariant and reusable representations. In this paper, we propose a method to learn such representations called element randomization, which extracts task-relevant but environment-agnostic semantics from instructions using a set of environments with randomized elements, e.g., topological structures or textures, yet the same language instruction. We theoretically prove the feasibility of learning semantically invariant representations through randomization. In practice, we accordingly develop a hierarchy of policies, where a high-level policy is designed to modulate the behavior of a goal-conditioned low-level policy by proposing subgoals as semantically invariant representations. Experiments on challenging long-horizon tasks show that (1) our low-level policy reliably generalizes to tasks against environment changes; (2) our hierarchical policy exhibits extensible generalization in unseen new tasks that can be decomposed into several solvable sub-tasks; and (3) by storing and replaying language trajectories as succinct policy representations, the agent can complete tasks in a one-shot fashion, i.e., once one successful trajectory has been attained.

下载PDF全文

下载文献需遵守相关版权规定

论文标题