PANDR：通过脱钩政策和环境表示，快速适应了离线体验的新环境

论文标题

PANDR：通过脱钩政策和环境表示，快速适应了离线体验的新环境

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

论文作者

Sang, Tong, Tang, Hongyao, Ma, Yi, Hao, Jianye, Zheng, Yan, Meng, Zhaopeng, Li, Boyan, Wang, Zhen

论文摘要

深入加强学习（DRL）一直是解决许多复杂决策问题的有前途的解决方案。然而，环境之间臭名昭著的弱点阻止了DRL代理在现实世界中的广泛应用。尽管最近取得了进步，但大多数先前的作品都在培训环境上采用足够的在线互动，在实际情况下，这可能是昂贵的。为此，我们专注于离线训练在线适应设置，在该设置中，代理商首先从具有不同动态的环境中收集的离线体验中学习，然后在具有新动态的环境中执行在线策略适应。在本文中，我们提出了使用脱钩表示（PANDR）进行快速政策适应的政策适应。在离线培训阶段，分别通过对比度学习和政策恢复学会了环境代表和政策代表。通过相互信息优化进一步完善表示形式，以使它们更加脱钩和完整。借助学习的表示，对策略 - 动力学价值函数（PDVF）[Reareanu等，2020]网络进行了训练，以近似离线体验中政策和环境不同组合的值。在在线适应阶段，通过在新环境中收集的几个经验推断出的环境环境，该策略是通过PDVF的梯度上升来优化的。我们的实验表明，在几种代表性的政策适应问题中，PANDR优于现有算法。

Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics. In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation. In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) [Raileanu et al., 2020] network is trained to approximate the values for different combinations of policies and environments from offline experiences. In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF. Our experiments show that PAnDR outperforms existing algorithms in several representative policy adaptation problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题