论文标题
具有稀疏反馈的复杂操纵任务的深度加强学习
Deep Reinforcement Learning for Complex Manipulation Tasks with Sparse Feedback
论文作者
论文摘要
从稀疏反馈中学习最佳政策是增强学习的已知挑战。 Hindsight Experience重播(她)是一种用于解决此类任务的多目标增强学习算法。该算法将每一次失败视为在情节中实现的替代(虚拟)目标的成功,然后从虚拟目标概括到真实目标。她已经知道缺陷,并且仅限于相对简单的任务。在本文中,我们根据现有的算法提高了三种算法,从而改善了其性能。首先,我们优先考虑代理商将学习更多有价值信息的虚拟目标。我们将此属性称为虚拟目标的\ textit {stresementimentions},并通过启发式措施来定义它,这表明了代理商将能够从该虚拟目标概括到实际目标。其次,我们设计了一个过滤过程,该过程检测并消除了可能在整个学习过程中引起偏见的误导性样本。最后,我们可以使用一种结合她的课程学习形式来学习复杂,顺序的任务。我们称此算法\ textit {curriculum her}。为了测试我们的算法,我们建立了三个具有稀疏奖励功能的具有挑战性的操纵环境。每个环境都有三个级别的复杂性。与原始算法相比,我们的经验结果表明,最终成功率和样品效率的提高。
Learning optimal policies from sparse feedback is a known challenge in reinforcement learning. Hindsight Experience Replay (HER) is a multi-goal reinforcement learning algorithm that comes to solve such tasks. The algorithm treats every failure as a success for an alternative (virtual) goal that has been achieved in the episode and then generalizes from that virtual goal to real goals. HER has known flaws and is limited to relatively simple tasks. In this thesis, we present three algorithms based on the existing HER algorithm that improves its performances. First, we prioritize virtual goals from which the agent will learn more valuable information. We call this property the \textit{instructiveness} of the virtual goal and define it by a heuristic measure, which expresses how well the agent will be able to generalize from that virtual goal to actual goals. Secondly, we designed a filtering process that detects and removes misleading samples that may induce bias throughout the learning process. Lastly, we enable the learning of complex, sequential, tasks using a form of curriculum learning combined with HER. We call this algorithm \textit{Curriculum HER}. To test our algorithms, we built three challenging manipulation environments with sparse reward functions. Each environment has three levels of complexity. Our empirical results show vast improvement in the final success rate and sample efficiency when compared to the original HER algorithm.