将对比度学习与动态模型相结合，以增强图像学习

论文标题

将对比度学习与动态模型相结合，以增强图像学习

Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images

论文作者

You, Bang, Arenz, Oleg, Chen, Youping, Peters, Jan

论文摘要

从图像进行加固学习的最新方法使用辅助任务来学习代理策略或Q功能所使用的图像功能。特别是，基于对比度学习的方法表明，诱导潜在动力学或不变性对数据增强的线性，可极大地提高增强算法的样本效率以及学习嵌入的概括性。我们进一步争辩说，明确提高所学习嵌入的马克维亚是可取的，并提出了一种自我监督的表示的学习方法，将对比度学习与动态模型整合在一起，以协同结合这三个目标：（1）我们最大程度地限制了嵌入的嵌入和动作的信息，以提出施加的启示，并在启发过程中构成了施加的启动，并构成了施加的启示，并构成了施加的启动，并构成了施加的嵌入，并提出了启动的嵌入，并提出了启动的启动。线性过渡模型，（2）我们通过使用回归明确学习非线性过渡模型，进一步提高了学到的嵌入的马克维亚性，（3）我们根据当前的行动和两个自然构建不仅置于迁移的状态，最大程度地提高了下一个嵌入的两个非线性预测之间的相互信息，这是自然而然的。对DeepMind Control Suite的实验评估表明，与基于对比度学习或重建的最新方法相比，我们提出的方法可实现更高的样本效率和更好的概括。

Recent methods for reinforcement learning from images use auxiliary tasks to learn image features that are used by the agent's policy or Q-function. In particular, methods based on contrastive learning that induce linearity of the latent dynamics or invariance to data augmentation have been shown to greatly improve the sample efficiency of the reinforcement learning algorithm and the generalizability of the learned embedding. We further argue, that explicitly improving Markovianity of the learned embedding is desirable and propose a self-supervised representation learning method which integrates contrastive learning with dynamic models to synergistically combine these three objectives: (1) We maximize the InfoNCE bound on the mutual information between the state- and action-embedding and the embedding of the next state to induce a linearly predictive embedding without explicitly learning a linear transition model, (2) we further improve Markovianity of the learned embedding by explicitly learning a non-linear transition model using regression, and (3) we maximize the mutual information between the two nonlinear predictions of the next embeddings based on the current action and two independent augmentations of the current state, which naturally induces transformation invariance not only for the state embedding, but also for the nonlinear transition model. Experimental evaluation on the Deepmind control suite shows that our proposed method achieves higher sample efficiency and better generalization than state-of-art methods based on contrastive learning or reconstruction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题