AACHER：与事后见解经验重播的各种演员 - 批评深度强化学习

论文标题

AACHER：与事后见解经验重播的各种演员 - 批评深度强化学习

AACHER: Assorted Actor-Critic Deep Reinforcement Learning with Hindsight Experience Replay

论文作者

Sehgal, Adarsh, Sehgal, Muskan, La, Hung Manh

论文摘要

演员学习和评论家学习是杰出且主要使用的深层确定性政策梯度（DDPG）强化学习方法的两个组成部分。由于演员和评论家的学习在整个机器人的学习中起着重要作用，因此DDPG方法的性能相对敏感且不稳定。我们提出了一种多动态批判性DDPG，用于可靠的参与者 - 批判性学习，以进一步提高DDPG的性能和稳定性。然后将这种多批评的DDPG与事后的经验重播（她）集成在一起，以形成我们新的深度学习框架Aacher。 Aacher使用多个演员或批评家的平均价值代替DDPG的单身演员或评论家，以增加一位演员或评论家表现不佳的情况。许多独立的参与者和批评家也可以更广泛地从环境中获取知识。我们在基于目标的环境上实施了拟议的AACHER：Auboreach，Fetchreach-V1，FetchPush-V1，FetchSlide-V1和FetchPickandPlace-V1。对于我们的实验，我们使用了各种演员/评论家组合的实例，其中A10C10和A20C20是表现最好的组合。总体结果表明，Aacher在所有用于评估的演员/评论家数字组合中都优于传统算法（DDPG+HE）。当在FetchPickandPlace-V1上使用时，A20C20的性能提升高约3.8倍DDPG+HE的成功率。

Actor learning and critic learning are two components of the outstanding and mostly used Deep Deterministic Policy Gradient (DDPG) reinforcement learning method. Since actor and critic learning plays a significant role in the overall robot's learning, the performance of the DDPG approach is relatively sensitive and unstable as a result. We propose a multi-actor-critic DDPG for reliable actor-critic learning to further enhance the performance and stability of DDPG. This multi-actor-critic DDPG is then integrated with Hindsight Experience Replay (HER) to form our new deep learning framework called AACHER. AACHER uses the average value of multiple actors or critics to substitute the single actor or critic in DDPG to increase resistance in the case when one actor or critic performs poorly. Numerous independent actors and critics can also gain knowledge from the environment more broadly. We implemented our proposed AACHER on goal-based environments: AuboReach, FetchReach-v1, FetchPush-v1, FetchSlide-v1, and FetchPickAndPlace-v1. For our experiments, we used various instances of actor/critic combinations, among which A10C10 and A20C20 were the best-performing combinations. Overall results show that AACHER outperforms the traditional algorithm (DDPG+HER) in all of the actor/critic number combinations that are used for evaluation. When used on FetchPickAndPlace-v1, the performance boost for A20C20 is as high as roughly 3.8 times the success rate in DDPG+HER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题