论文标题

非统计环境的战略务虚会的非参数随机政策梯度

Non-Parametric Stochastic Policy Gradient with Strategic Retreat for Non-Stationary Environment

论文作者

Dastider, Apan, Lin, Mingjie

论文摘要

在现代机器人技术中,有效计算动态变化环境下的最佳控制政策对现成的参数策略梯度方法提出了重大挑战,例如深层确定性政策梯度(DDPG)和双重延迟的深层确定性策略梯度(TD3)。在本文中,我们提出了一种系统的方法,以非距离动态地学习一系列最佳控制策略,同时自主适应不断变化的环境动态。具体而言,我们的非参数基因方法将策略分布嵌入了非降低欧几里得空间中的特征,因此允许其搜索空间定义为非常高的(可能的无限)维度RKHS(重现Hilbert Space)。此外,通过利用RKHS中计算的相似性度量,我们通过自适应选择的技术来增强非参数学习,从而在某些前一个观察到的状态下选择了整个动作序列的最佳部分。为了验证我们提出的方法,我们通过多个经典基准和一个模拟机器人技术基准进行了广泛的实验,配备了动态变化的环境。总体而言,我们的方法学优于公认的DDPG和TD3方法,在学习绩效方面的差距很大。

In modern robotics, effectively computing optimal control policies under dynamically varying environments poses substantial challenges to the off-the-shelf parametric policy gradient methods, such as the Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic policy gradient (TD3). In this paper, we propose a systematic methodology to dynamically learn a sequence of optimal control policies non-parametrically, while autonomously adapting with the constantly changing environment dynamics. Specifically, our non-parametric kernel-based methodology embeds a policy distribution as the features in a non-decreasing Euclidean space, therefore allowing its search space to be defined as a very high (possible infinite) dimensional RKHS (Reproducing Kernel Hilbert Space). Moreover, by leveraging the similarity metric computed in RKHS, we augmented our non-parametric learning with the technique of AdaptiveH- adaptively selecting a time-frame window of finishing the optimal part of whole action-sequence sampled on some preceding observed state. To validate our proposed approach, we conducted extensive experiments with multiple classic benchmarks and one simulated robotics benchmark equipped with dynamically changing environments. Overall, our methodology has outperformed the well-established DDPG and TD3 methodology by a sizeable margin in terms of learning performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源